NEW SONORITIES FOR EARLY JAZZ RECORDINGS USING SOUND SOURCE SEPARATION AND AUTOMATIC MIXING TOOLS Daniel Matz University of Applied Sciences Düsseldorf, Germany daniel.matz@hotmail.de Estefanía Cano Fraunhofer IDMT Ilmenau, Germany cano@idmt.fhg.de ABSTRACT In this paper, a framework for automatic mixing of early jazz recordings is presented. In particular, we propose the use of sound source separation techniques as a preprocessing step of the mixing process. In addition to an initial solo and accompaniment separation step, the proposed mixing framework is composed of six processing blocks: harmonic-percussive separation (HPS), cross-adaptive multi-track scaling (CAMTS), cross-adaptive equalizer (CAEQ), cross-adaptive dynamic spectral panning (CADSP), automatic excitation (AE), and time-frequency selective panning (TFSP). The effects of the different processing steps in the final quality of the mix are evaluated through a listening test procedure. The results show that the desired quality improvements in terms of sound balance, transparency, stereo impression, timbre, and overall impression can be achieved with the proposed framework. 1. INTRODUCTION When early jazz recordings are analyzed from a modern audio engineering perspective, clear stylistic differences can be identified with respect to modern recording techniques. These characteristics mainly evidence the technological and stylistic differences between the two eras. For example, solo instruments such as the saxophone or the trumpet often completely dominate the audio mix in early jazz recordings. At the same time, the rhythm section, i.e., double bass, piano, drums, and percussion, often falls in a secondary place, recorded or mixed with much lower intensity and often perceived as unclear and undifferentiated. Additionally, from today’s perspective, early jazz recordings often present an unusual stereo image. Instrument groups are sometimes assigned to extreme stereo positions which can cause the solo instrument to be panned to the left and the accompaniment band panned to the right. As a consequence, the energy distribution over the stereo width is unbalanced and is often perceived today as irritating and disturbing, especially when listened through headphones. c Daniel Matz, Estefanía Cano, Jakob Abeßer. Licensed under a Creative Commons Attribution 4.0 International License (CC BY 4.0). Attribution: Daniel Matz, Estefanía Cano, Jakob Abeßer. “New sonorities for early jazz recordings using sound source separation and automatic mixing tools”, 16th International Society for Music Information Retrieval Conference, 2015. Jakob Abeßer Fraunhofer IDMT Ilmenau, Germany abr@idmt.fhg.de Several initiatives have arisen that attempt to give such early recordings a more modern sonority. Remastering and Automatic Mixing (AM) techniques offer various methods for a sonic redesign of such recordings. However, given that the original individual stems of the instruments in the recordings are usually not available, these techniques can only achieve minor modifications to the sound characteristics of mono and stereo mixtures. In-depth remixing usually requires the original multi-track recordings to be available. For this purpose, sound source separation methods developed in the Music Information Retrieval (MIR) community can be useful tools to retrieve individual instruments from a given mix. 2. GOALS The main goal of this study is to identify suitable signal processing methods to modify the above-mentioned characteristics in a selection of early jazz recordings. These methods are combined in a fully automatic mixing framework. In particular, we focus on modifying the audio mix in terms of transparency, stereo impression, frequency response, and acoustic balance in order to improve the overall perception of sound and the quality of the mix with respect to the original recording. Our main approach for remixing is to modify the characteristic of the backing track to make it more present in the mix. We also aim at improving the acoustical and spatial balance of th audio mix. The solo signal is balanced with respect to its loudness and spectral energy to minimize spectral masking as well as to improve its position in the stereo image. 3. RELATED WORK In the field of automatic mixing, several approaches have been presented in the literature. In [1], a method is proposed to automatically adjust gain and equalizer parameters for multi-track recordings using a least-squares optimization. In [12] the idea of modifying the magnitude spectrogram of a signal towards a target spectrogram called target mixing, is presented. Other approaches for automatic mixing of multi-track recordings have incorporated machine learning algorithms to perform the mixing process [16, 17]. 749 Proceedings of the 16th ISMIR Conference, Málaga, Spain, October 26-30, 2015 750 xe (n) xm (n ) 543 4 Konzeption und Umsetzung 543 4 Konzeption und Umsetzung Multi-band adaptive gate AM-DAFX IMPLEMENTATIONS h 0,m (n ) hK–1,m (n ) Filter bank (with K bands) h 0,m (n ) RMS(h 0,m (n )) heK–1(n ) Gate hg 0,m (n) tr 0,m (n) RMS(h 0,m (n )) > RMS(he 0(n)) RMS(he 0(n )) he 0(n ) Filter bank (with K bands) xl 0,m (n) Avr(xl 0,m (n)) Fixed SPL Multi-band adaptive gate 4 Konzeption und Umsetzung SPL AM-DAFX IMPLEMENTATIONS 543 measurement AM-DAFX IMPLEMENTATIONS Table of hg 0,m (n) 543 0,m to its amplitude level so that, by manipulating its gain per band level, we can achieve a desired tr 0,m (n) tr (n) 0 h 0,m (n ) h (n ) 0,m 0,m by f v probable loudness per band of a given input and is denoted k,m (n). The block diagram of this multi-band loudness feature extraction method Gate is shown in Figure 13.11. (13.18) f vk,mused (n)/K /M the target average loudness (4.11) where cvk,m (n) is the controll(n) gain=variable per band to achieve Output most probable loudness per band: f v (n ) m=0 k=0 extraction block has a function l(n) and H lk,m (n) is the function of the desired system. The featurek,m (13.18) 42 xe (n) 42 42 xe (n ) xm (n ) 42 543 543 4 Konzeption und Umsetzung AM-DAFX IMPLEMENTATIONS Multi-band adaptive gate 4 Konzeption und Umsetzung 42 Multi-band adaptive gate AM-DAFX IMPLEMENTATIONS h 0,m (n ) Gate hg 0,m (n ) hK–1,m (n) Filter bank (with K bands) h 0,m (n ) RMS(h 0,m (n )) heK–1(n) tr 0,m (n ) RMS(h 0,m (n )) > RMS(he 0(n)) RMS(he 0(n)) he 0(n ) Filter bank (with K bands) h 0,m (n ) Output most probable loudness per band: fvk,m (n ) Table of loudness hg 0,m (n ) RMS(h 0,m (n )) > RMS(he 0(n)) 0 tr 0,m (n ) 0 1 Gate Probability mass function 0,m xm (n ) Table of The same Filter bank loudness (with K bands) for xe (n ) xe(n ) weighting all K filters h 0,m (ncoefficients ) Filter bank hg (with K bands) 0,m (n)*w(SP(n)) hK–1,m (n) SP(n ) he 0(n )S Avr(xl 0,m )) he(n SPL K–1(n) measurement xl (n) RMS(he 0(n)) 0,m RMS(h 0,m (n )) 1 Fixed SPL f v (n ) The same (n) f v (n ) h 0,m (n ) Filter bank (with K bands)0,m 1 Cross-adaptive feature processing xl Fixed SPL measurement probable loudness per band of a given input and is denoted SP(n ) by f vk,m (n). The block diagram of Avr(xl 0,mx(n )) S SPL 13.11. m (n ) this multi-band loudness feature extraction method is shown in Figure 0,m for 4.4: Figure 13.11 Blockder diagram ofweighting an automatic equaliser. 42 4 Konzeption und Umsetzung Abbildung Blockdiagramm Merkmalsgewinnung xe(n )des implementierten all K filters AM-DAFX IMPLEMENTATIONS 543 cross-adaptiven Equalizers coefficients hg (n)*w(SP(n)) 0,m Filter bank (with K bands) Cross-adaptive xe (n )Merkmalsverarbeitung k,m Filter bank (with K bands) h (n ) hK–1,m (n) e xe (n ) all K filters m=0 k=0 Cross-adaptive feature processing coefficients is the function of the desired system. The feature extraction block has a function l(n) and H lk,m (n) hg 0,m (n)*w(SP(n)) m probable loudness per band of a given input of cvk,m (n) = and is denoted by , f vk,m (n). The block diagram (13.18) H lk,mis (n)x this multi-band loudness feature extraction method shown Table of in Figure 13.11. k,m (n) M 1 xK (n1) The same Cross-adaptive processing mapping the perceptual loudness of each spectral Innerhalb desfeature CAFPB geht esconsists nun umofdie Zuordnung der häufigsten Lautheit f vk,mband (n) he00(n ) heK–1its (n) gain per band level, we can achieve a desired to its amplitude level so that, by manipulating Multi-band jedes Frequenzbandes zu0 einem Verstärkungsfaktor cvk,m (n) des korrespondierenden 1 target loudness level. We aim to achieve an average value l(n), therefore we must decrease RMS(h loudness (n )) RMS(he adaptive 0(n)) Probability mass function 0,m EQ-Bandes Tracks. Hierdurch kannabove die Verstärkung eines spezifischen the loudness offür thealle equalisation bands for signals this average and increase thegate band Bandes gain for signals belowwerden, this average. Thisdieses resultsletztendlich in a system indie which we have a multiplier cvak,m (n) per beeinflusst sodass beabsichtigte RMS(h (n ))probable > RMS(he Output 0,m loudness band: fvk,m (n ) Lautheit zu besitzt. 0(n)) band, such that we scale the inputmost bands xk,m (n) in orderper to achieve a desired average loudness l(n). Dies erfolgtofintheAnlehnung anapproximated die trcross-adaptive Verarbeitung wobei (n= ) l(n)/(cvk,mdes h 0,m 0,m (n ) by H lk,m The function system can be (n) (n)xCAMTS, k,m (n)), where the Figure 13.11 Blockder diagram of an automatic equaliser. Abbildung 4.4: (n) Blockdiagramm Merkmalsgewinnung des Mittelwertbildung implementierten control vector cv is given by Ziel-Lautheit zunächst ebenfalls mittlere l(n) mit Hilfe der k,m eine 42 4 Konzeption und Umsetzung Gate cross-adaptiven Equalizers AM-DAFX IMPLEMENTATIONS 543 hg 0,m (n ) über alle f vk,m (n) für alle K Frequenzbänder und M Signale bestimmt wird: l(n) loudness for Cross-adaptive Merkmalsverarbeitung f vk,m (n)/K /M x the (4.11) (n ) target average loudness weighting where cv (n) is the controll(n) gain=variableFilter per bank band used achieve (with to K bands) Cross-adaptive feature processing 1 Table of H lk,m (n)xk,m (n) K 1 f v0,m (n ) Cross-adaptive Merkmalsverarbeitung xl 0,m (n) Fixed SPL measurement (13.18) The same Figure 13.11 Blockder diagram ofloudness an automatic f vk,m (n)/K /Mequaliser. (4.11) Abbildung where cvk,mfor (n)4.4: is theBlockdiagramm controll(n) gain=variable perMerkmalsgewinnung band used to achieve thedes targetimplementierten average loudness m=0 k=0 xe(n ) weighting cross-adaptiven Equalizers filters (n) is the function of the desired system. The feature extraction block has a function l(n) andallHKlk,m coefficients probable loudness per bandhgof0,ma(n)*w(SP(n)) given input and is denoted by f vk,m (n). The block diagram of SP(n ) in Figure 13.11. this multi-band loudness feature is shown Avr(xl 0,mmethod (n )) S extraction SPL M 1 hg 0,m (n ) cvk,m (n) = loudness , fvk,m (n ) Output most probable per band: 0,m hK–1,m (n) of Cross-adaptive processing mapping the perceptual loudness each spectral SP(n )der häufigsten Innerhalb desfeature CAFPB gehtSesconsists nun umofdie Zuordnung Lautheit f vk,mband (n) Avr(xl 0,m (n )) SPL we can achieve a desired to its amplitude level so that, level, ) by manipulating 0(n einem heK–1its (n) gain per band jedes Frequenzbandeshezu Verstärkungsfaktor cvmeasurement k,m (n) des korrespondierenden xl 0,m Bachelorarbeit Daniel Matz target loudness level. We aim to achieve an(n) average loudness value l(n), therefore Multi-band we must decrease RMS(h ))this average adaptive EQ-Bandes Tracks. Hierdurch kannabove die Verstärkung eines spezifischen 0,m (n 0(n))for signals 1RMS(he the loudness offür thealle equalisation bands and increase the band Bandes gain for Fixed SPL f v0,m (n ) gate (n) per signals belowwerden, this average. Thisdieses resultsletztendlich in a system indie which we have a multiplier beeinflusst sodass beabsichtigte Lautheitcva zuk,mbesitzt. band, such that we scale the input bands xk,m (n) in order to achieve a desired average loudness l(n). RMS(h 0,m (n )) > RMS(he 0(n)) Dies erfolgtofintheAnlehnung anapproximated die cross-adaptive Verarbeitung wobei The function system0 can be by H lk,m (n) = l(n)/(cvk,mdes (n)xCAMTS, k,m (n)), where the tr 0,m (n ) 1 0 h 0,m (n ) control vector cvk,m (n)eine is given by Ziel-Lautheit zunächst ebenfalls mittlere l(n) mit Hilfe der Mittelwertbildung Probability mass function Gate über alle f vk,m (n) für alle K Frequenzbänderl(n) und M Signale bestimmt wird: Bachelorarbeit Daniel Matz (13.18) xe (n) RMS(he 0(n)) he 0(n ) Filter bank (with K bands) xm (n ) h 0,m (n ) 1 fv0,m (n) RMS(h 0,m (n )) > RMS(he 0(n)) hK–1,m (n) Multi-band adaptive gate 543 4 Konzeption und Umsetzung 543 4 Konzeption und Umsetzung AM-DAFX IMPLEMENTATIONS Filter bank (with K bands) h 0,m (n) RMS(h 0,m (n )) heK–1(n) tr 0,m (n ) RMS(h 0,m (n )) > RMS(he 0(n)) AM-DAFX IMPLEMENTATIONS Gate hg 0,m (n) xm (n ) Table of The same Filter bank loudness (with K bands) xe (n) for xe(n) weighting all K filters h 0,m (ncoefficients ) Filter bank (with K bands) h hg 0,m (n)*w(SP(n)) K–1,m (n) SP(n) he(n)) Avr(xl 0,m K–1(n) SPL Multi-band measurement adaptive gate Fixed SPL xl 0,m (n) RMS(h 0,m (n )) RMS(he 0(n)) he 0(n )S tr 0,m (n ) h 0,m (n) 0 0 1 Gate Probability mass function hg 0,m (n) Output most probable loudnessTable per band: fvk,m (n) of 543 4 Konzeption und Umsetzung fv0,m (n) Fixed SPL k,m band, such that we scale the input bands xk,m (n) in order to achieve a desired average loudness l(n). Dies erfolgtofintheAnlehnung anapproximated die cross-adaptive Verarbeitung wobei The function system can be by H l (n) = l(n)/(cv des (n)xCAMTS, (n)), where the k,mautomatic equaliser. k,m k,m 13.11 Blockder diagram of an Abbildung 4.4: Figure Blockdiagramm Merkmalsgewinnung des implementierten heK–1(n) control vector cvk,m (n)eine is given by Ziel-Lautheit l(n) mit Hilfe der Mittelwertbildung zunächst ebenfalls mittlere cross-adaptiven Equalizers Multi-band über alleloudness f vk,m (n) alleof K Frequenzbänder und M by Signale bestimmt wird: probable perfür band a given input and isl(n) denoted f vk,m (n). The block diagram of cvk,m (n) = (13.18) RMS(h 0,m (nthis )) multi-band loudness feature extraction adaptive method shown in, Figure 13.11. H lk,mis (n)x k,m (n) M 1 K 1 Cross-adaptive Merkmalsverarbeitung gate f vk,mused (n)/K /M the target average loudness (4.11) where cvk,m (n) is the controll(n) gain=variable per band to achieve tr 0,m (n) RMS(h 0,m (n)) > RMS(he 0(n)) Avr(xl 0,m (n)) fv0,m (n) xl 0,m (n) Cross-adaptive feature processing m=0 k=0 l(n) and H lk,m (n) is the function of the desired system. The feature extraction block has a function band, such that we scale the input bands xk,m (n) in order to achieve a desired average loudness l(n). Cross-adaptive feature processing consists of mapping the perceptual loudness of each spectral band Innerhalb des CAFPB geht es nun um die Zuordnung der häufigsten Lautheit f v (n) k,m to its amplitude level so that, by manipulating its gain per band level, we can achieve a desired jedes Frequenzbandes zu einem Verstärkungsfaktor cvk,m (n) des korrespondierenden target loudness level. We aim to achieve an average loudness value l(n), therefore we must decrease Bachelorarbeit Daniel Matz EQ-Bandes Tracks. Hierdurch kannabove die Verstärkung eines spezifischen the offür thealle equalisation bands for signals this average and increase the band Bandes gain for (n) hloudness 0,m (n) per signals belowwerden, this average. Thisdieses resultsletztendlich in a system indie which we have a multiplier beeinflusst sodass beabsichtigte Lautheitcva zuk,mbesitzt. SP(n) Fixed SPL SPL measurement H lk,m (n)xk,m (n) K 1 , (13.18) f vk,mused (n)/K /M the target average loudness (4.11) where cvk,m gain=variable per band to achieve Table of(n) is the controll(n) m=0 k=0 l(n) and H lk,m (n) is the function of the desired system. The feature extraction block has a function loudness xe(n) weighting Bachelorarbeit Daniel Matz coefficients M 1 cvk,m (n) = Dies erfolgtofintheAnlehnung anapproximated die cross-adaptive Verarbeitung wobei The function system can be by H lk,m (n) = l(n)/(cvk,mdes (n)xCAMTS, k,m (n)), where the Gate control vector cvk,m (n)eine is given by Ziel-Lautheit l(n) mit Hilfe der Mittelwertbildung zunächst ebenfalls mittlere hg 0,m (n)über alle f vk,m(n) für alle K Frequenzbänderl(n) und M Signale bestimmt wird: hg 0,m (n)*w (SP(n)) S 0,m target loudness level. We aim to achieve an function average loudness value l(n), therefore we must decrease Probability mass (with Kdesbands) Cross-adaptive feature processing mapping the perceptual loudness of each spectral Innerhalb CAFPB geht esconsists nun umofdie Zuordnung der häufigsten Lautheit f vk,mband (n) 0 to its amplitude level so that, by manipulating its1 gain per band level, we can achieve a desired jedes Frequenzbandes zu0 einem Verstärkungsfaktor cvk,m (n) des korrespondierenden Cross-adaptive feature processing is the function of the desired system. The feature extraction block has a function l(n) and H lk,m (n) Cross-adaptive feature processing Filter bank The same loudness AM-DAFX IMPLEMENTATIONS for 13.11 Blockder diagram ofweighting an automatic equaliser. xe(n)des implementierten Abbildung 4.4: Figure Blockdiagramm Merkmalsgewinnung all K filters cross-adaptiven Equalizers coefficients hg 0,m (n)*w(SP(n)) SP(n) by f v (n). The block diagram of probable loudness per band ofS a given input and is denoted k,m Avr(xl 0,m (n)) SPL this multi-band loudness feature extraction method is shown inmeasurement Figure 13.11. xl (n) Cross-adaptive Merkmalsverarbeitung 1 cvk,m (n) = and is denoted by , (13.18) probable loudness per band of a given input H lk,m (n)xk,m (n) f vk,m (n). The block diagram of M 1 method K 1 this multi-band loudness feature extraction is shown in Figure 13.11. xm (n) Cross-adaptive processing mapping the perceptual loudness of each spectral Innerhalb desfeature CAFPB geht esconsists nun umofdie Zuordnung der häufigsten Lautheit f vk,mband (n) to its amplitude level so that, by manipulating its gain per band level, we can achieve a desired jedes Frequenzbandes 0zu einem Verstärkungsfaktor cvk,m (n) des korrespondierenden target loudness level. We aim 0 to achieve an average 1 loudness value l(n), therefore we must decrease EQ-Bandes Tracks. Hierdurch kannabove die Verstärkung eines spezifischen the loudness offür thealle equalisation bandsmass for signals this average and increase the band Bandes gain for Probability function (n) per signals belowwerden, this average. Thisdieses resultsletztendlich in a system indie which we have a multiplier beeinflusst sodass beabsichtigte Lautheitcva zuk,mbesitzt. band, such that we scale the inputmost bands xk,m (n) in orderper to achieve a desired average loudness l(n). Output probable loudness band: fvk,m (n ) Dies erfolgtofintheAnlehnung anapproximated die cross-adaptive Verarbeitung wobei The function system can be by H lk,m (n) = l(n)/(cvk,mdes (n)xCAMTS, k,m (n)), where the control vector cv is given by Block zunächst ebenfalls eine mittlere Ziel-Lautheit l(n) mit Hilfe der Figure 13.11 diagram of an automatic equaliser. k,m (n) Abbildung 4.4: Blockdiagramm der Merkmalsgewinnung des Mittelwertbildung implementierten Equalizers l(n) über alle f vk,m (n)cross-adaptiven für alle K Frequenzbänder und M Signale bestimmt wird: l(n) f vk,mused (n)/K /M the target average loudness (4.11) where cvk,m (n) is the control gain=variable per band to achieve Cross-adaptive Merkmalsverarbeitung m=0 k=0 , Tracks. Hierdurch kannabove die Verstärkung eines spezifischen the loudness offür thealle equalisation bands for signals this average and increase the band Bandes gain for h 0,m (n )EQ-Bandes (n) per signals belowwerden, this average. Thisdieses results indie which wefvhave hOutput (n )in a system beeinflusst sodass letztendlich beabsichtigte Lautheitcva zuk,mbesitzt. most probable loudness per band: (n)a multiplier K–1,m H lk,m (n)xk,m (n) K 1 f vk,mused (n)/K /M the target average loudness (4.11) where cvk,m (n) is the controll(n) gain=variable per band to achieve m=0 k=0 l(n) and H lk,m (n) is the function of the desired system. The feature extraction block has a function M 1 cvk,m (n) = Cross-adaptive processing mapping the perceptual loudness of each spectral Innerhalb desfeature CAFPB geht esconsists nun umofdie Zuordnung der häufigsten Lautheit f vk,mband (n) to its amplitude level so that, by manipulating its gain per band level, we can achieve a desired Bachelorarbeit Daniel Matz zu einem Verstärkungsfaktor cvk,m (n) des korrespondierenden jedes Frequenzbandes target loudness level. We aim to achieve an average loudness value l(n), therefore we must decrease EQ-Bandes Tracks. Hierdurch kannabove die Verstärkung eines spezifischen the loudness offür thealle equalisation bands for signals this average and increase the band Bandes gain for (n) per signals belowwerden, this average. Thisdieses resultsletztendlich in a system indie which we have a multiplier beeinflusst sodass beabsichtigte Lautheitcva zuk,mbesitzt. band, such that we scale the input bands xk,m (n) in order to achieve a desired average loudness l(n). Dies erfolgtofintheAnlehnung anapproximated die cross-adaptive Verarbeitung wobei The function system can be by H lk,m (n) = l(n)/(cvk,mdes (n)xCAMTS, k,m (n)), where the control vector cvk,m (n)eine is given by Ziel-Lautheit l(n) mit Hilfe der Mittelwertbildung zunächst ebenfalls mittlere über alle f vk,m (n) für alle K Frequenzbänderl(n) und M Signale bestimmt wird: Bachelorarbeit Daniel Matz RMS(he 0(n )) he 0(n ) Filter bank (with K bands) The same for all K filters 1 Stereo Remix 42 42 he10(n ) he(n) K–1(n ) f v0,m AM-DAFX IMPLEMENTATIONS Table of loudness xe(n) weighting xm (n ) coefficients hg 0,m (n)*w(SP(n)) Filter bank (with K bands) xe (n) SP(n ) Avr(xl 0,m (n)) S SPL h 0,m (n ) Filter bank (with K bands) measurement xl 0,m (n) hK–1,m (n ) The same for all K filters hg 0,m (n)*w(SP(n)) for probable loudness per band of a given input and is denoted (n). The block diagram of xe(n) weightingby f vk,m all K filters this multi-band loudness feature extraction method is coefficients shown in Figure 13.11. loudness hg 0,m (n) Output most probableGate loudness per band: f vk,m (n ) RMS(h 0,m (n )) RMS(he 0(n )) 0 0 1 RMS(h 0,m (n )) > RMS(he 0(n)) Probability mass function tr 0,m (n) h 0,m (n ) 13.11 Blockder diagram of an automatic equaliser. Abbildung 4.4: Figure Blockdiagramm Merkmalsgewinnung des implementierten cross-adaptiven Equalizers Table of The same S Cross-adaptive feature processing 42 Cross-adaptive MerkmalsverarbeitungSP(n ) Cross-adaptive feature processing 4 Konzeption und Umsetzung m=0 method k=0 this loudness feature extraction is shown in Figure 13.11. block has a function Gate is the function of the desired system. The feature extraction l(n) multi-band and H lk,m (n) k,m Cross-adaptive processing mapping the perceptual loudness of each spectral Innerhalb desfeature CAFPB geht esconsists nun umofdie Zuordnung der häufigsten Lautheit f vk,mband (n) 1 Fixed SPL f v0,mx (n) to its amplitude level so that, by manipulating its gain per band level, we can achieve a desired m (n ) jedes Frequenzbandes zu einem Verstärkungsfaktor cvk,m (n) des korrespondierenden target loudness level. We aim to achieve an average loudness value l(n), therefore we must decrease Filter (with bands) and xequalisation EQ-Bandes Tracks. Hierdurch kannbank die Verstärkung eines spezifischen the loudness offür thealle bands for signals above this Kaverage increase the band Bandes gain for e (n) 0 0Thisdieses 1 indie (n) per signals belowwerden, thisFilter average. resultsletztendlich in a system we have a multiplier beeinflusst sodass beabsichtigte Lautheitcva zuk,mbesitzt. h 0,m (n ) which bank (with K bands) (n ) average loudness l(n). mass band, such that we scale theProbability input bands xk,mfunction (n) in order to achieve ahK–1,m desired Dies erfolgtofintheAnlehnung anapproximated die cross-adaptive Verarbeitung wobei he 0can (n ) be The function system by) H lk,m (n) = l(n)/(cvk,mdes (n)xCAMTS, k,m (n)), where the heK–1(n Multi-band control vector cvk,m (n)eine isOutput given by probable most loudness l(n) per band: zunächst ebenfalls mittlere Ziel-Lautheit mit f vHilfe k,m (n ) der Mittelwertbildung RMS(h 0,m (n )) RMS(he 0(n )) adaptive über alle f vk,m (n)Figure für alle K Frequenzbänder und M Signale bestimmt gate wird: l(n) 13.11 Block diagram of an automatic equaliser. Abbildung 4.4: Blockdiagramm des implementierten cvk,m (n)der = Merkmalsgewinnung , (13.18) RMS(h 0,m (n )) >Equalizers RMS(he (n))(n)xk,m (n) H lk,m cross-adaptiven M 1 K 10 tr ) /M h 0,m (n by 0,m (n) l(n) =variable f visk,m (n)/K (4.11) probable per control band of a given input and denoted f vk,mthe (n).target The block diagram of where cv loudness (n) is the gain per band used to achieve average loudness Cross-adaptive Merkmalsverarbeitung hg 0,m (n) 42 The same Bachelorarbeit Daniel Matz = 01(n)) , 0 0,mcv RMS(h (nk,m )) >(n) RMS(he H lk,m (n)xk,m (n) M tr1 function K 1 Probability mass h 0,m (n ) 0,m (n) f vk,mused (n)/K /M the target average loudness (4.11) where cvk,m (n) is the controll(n) gain=variable per band to achieve Gate m=0 k=0 Output most probable f vk,m (n ) of the desiredloudness system. per Theband: feature extraction block has a function l(n) and H lk,m (n) is the function (13.18) loudness Cross-adaptive processing mapping the perceptual loudness of each spectral Innerhalb desfeature CAFPB geht esconsists nun umofdie Zuordnung der häufigsten Lautheit f vk,mband (n) for xm (n ) xe(n) weighting to its amplitude by manipulating its gain per band level, we can achieve a desired all K filterslevel so that, jedes Frequenzbandes zu einem Verstärkungsfaktor cvk,m (n) des korrespondierenden coefficients target loudness level. We aimhgto0,machieve an average loudness l(n), therefore we must decrease (n)*w(SP(n)) Filter bank (with Kvalue bands) xe (n) EQ-Bandes Tracks. Hierdurch kannabove die Verstärkung eines spezifischen the loudness offür thealle equalisation bands for signals this average and increase the band Bandes gain for SP(n ) Avr(xl 0,m (n)) h (n ) S SPL a multiplier cva (n) per bank (with K bands) signals belowwerden, thisFilter average. This results in a system which we have 0,m indie beeinflusst sodass dieses letztendlich beabsichtigte zuk,mbesitzt. hK–1,m (nLautheit ) measurement xl 0,m (n)(n) in order to achieve band, such that we scale the input bands xk,m a desired average loudness l(n). he (n ) an die cross-adaptive heK–1(n Dies erfolgtofintheAnlehnung Verarbeitung des CAMTS, wobei The function system10can be approximated by) H lk,m (n) =Fixed l(n)/(cv (n)), where the SPL k,m (n)xk,m Multi-band f v0,m (n) control vector cvk,m (n)eine is RMS(he given by0(n ))Ziel-Lautheit RMS(h 0,m (nl(n) )) zunächst ebenfalls mittlere mit Hilfe der Mittelwertbildung adaptive gate über alle f vk,m (n) für 0alle K Frequenzbänderl(n) und M Signale bestimmt wird: 13.11 Blockder diagram of an automatic equaliser. Abbildung 4.4: Figure Blockdiagramm Merkmalsgewinnung des implementierten Table of Equalizers loudness The same cross-adaptiven Bachelorarbeitfor Daniel Matz Cross-adaptive feature processing xl 0,m (n) SPL measurement x (n) weighting probableallloudness K filters per band of a given input and is denoted by f vk,me(n). The block diagram of coefficients hg 0,m extraction (n)*w(SP(n)) this multi-band loudness feature method is shown in Figure 13.11. SP(n ) Cross-adaptive Merkmalsverarbeitung Avr(xl 0,m (n)) S Cross-adaptive processing of mapping the perceptual loudness of each spectral 1 Fixed SPL Innerhalb desfeature CAFPB geht esconsists nun um häufigsten Lautheit f vk,mband (n) f v die (n)Zuordnung der jedes Frequenzbandes zu einem Verstärkungsfaktor cv4k,mKonzeption (n) des korrespondierenden 42 Umsetzung target loudness level. We aim to achieve an average loudness value l(n), thereforeund we must decrease AM-DAFX IMPLEMENTATIONS 543 0 EQ-Bandes Tracks. Hierdurch kannabove die Verstärkung eines spezifischen the loudness offür thealle equalisation bands for signals this average and increase the band Bandes gain for 0 1 (n) per signals belowwerden, this average. Thisdieses resultsletztendlich in a system indie which we have a multiplier beeinflusst sodass beabsichtigte Lautheitcva zuk,mbesitzt. Probability mass function band, such that we scale the input bands xk,mx(n) ) order to achieve a desired average loudness l(n). m (nin Dies erfolgtofintheAnlehnung anapproximated die cross-adaptive Verarbeitung wobei The function system can be by H lk,m (n) = l(n)/(cvk,mdes (n)xCAMTS, k,m (n)), where the Filter bank (with K bands) Output most probable loudness per band: f vk,m (n ) xe(n) (n)eine control vector cvk,m is given by Ziel-Lautheit l(n) mit Hilfe zunächst ebenfalls mittlere der Mittelwertbildung h 0,mof (nund )an automatic equaliser. Filter bankalle (with bands) 13.11 Blockder diagram über alle f vk,m (n)Figure für K KFrequenzbänder wird: Abbildung 4.4: Blockdiagramm Merkmalsgewinnung des implementierten hK–1,mbestimmt (n ) l(n) M Signale cvk,m (n) = , (13.18) cross-adaptiven Equalizers he 0(n ) he H(n ) (n)xk,m (n) lk,m M 1 KK–11 probable loudness per band of a given input and is (n denoted f vk,m (n). The Multi-band block diagram of RMS(h )) RMS(he (n )) adaptive (4.11) l(n) f0,m vk,m (n)/Kby /M 0= where cvk,m (n) loudness is the control gain variable method per band target average loudness this multi-band feature extraction is used showntoinachieve Figure the 13.11. gate m=0 k=0 0,m Cross-adaptive feature processing (13.18) l(n) and H lk,m (n) is the function of the desired system. The feature extraction block has a function Cross-adaptive Merkmalsverarbeitung RMS(h (n )) > RMS(he (n)) , H lk,m (n)xk,m (n)Fixed SPL M f v1 K 1 (n) xm (n ) 0,m xl 0,m (n) cvk,m (n) = Multi-band adaptive gate f vk,mused (n)/K /M the target average loudness (4.11) where cvk,m (n) is the controll(n) gain=variable per band to achieve m=0 Filter k=0bank (with K bands) the function of the desired system. The feature extraction block has a function l(n) and H lk,m (n) xise (n) 0 1 Cross-adaptive feature mapping the perceptual loudness of each spectral Bachelorarbeit Daniel Matzprocessing Innerhalb des CAFPB geht esconsists nun umofdie Zuordnung der häufigsten Lautheit f vk,mband (n) Gateits gain per band level, we can achieve a desired to its amplitude level so that, by manipulating jedes Frequenzbandes zu einem Verstärkungsfaktor cvk,m (n) des korrespondierenden hg 0,m (n) target loudness level. We aim to achieve an average loudness value l(n), therefore we must decrease EQ-Bandes Tracks. Hierdurch kannabove die Verstärkung eines spezifischen the loudness offür thealle equalisation bands for signals this average and increase the band Bandes gain for Table of same (n) per signals The below this average. Thisdieses resultsletztendlich in a system in which we have a multiplier beeinflusst werden, sodass die beabsichtigte Lautheitcva zuk,mbesitzt. loudness band, such for that we scale the input bands xk,m (n) in order to achieve a desired xe(n) average loudness l(n). weighting all K filters Dies erfolgt anapproximated die cross-adaptive Verarbeitung wobei The function ofintheAnlehnung system hg can be by Hcoefficients lk,m (n) = l(n)/(cvk,mdes (n)xCAMTS, k,m (n)), where the 0,m (n)*w(SP(n)) control vector cvk,m (n)eine is given by Ziel-Lautheit l(n) mit Hilfe der Mittelwertbildung zunächst ebenfalls mittlere SP(n ) 4 Konzeption und Umsetzung 42 Avr(xl 0,m (n)) S SPL IMPLEMENTATIONS 543 über alle f vk,m (n) für alle K Frequenzbänder undAM-DAFX M measurement Signale bestimmt wird: l(n) h10,m (n ) Filter bank 0 (with K bands) hK–1,m (n ) Probability mass function he 0(n ) heK–1(n ) Output most probable loudness per band: f vk,m (n ) RMS(h RMS(he 0(n )) 0,m (n )) Bachelorarbeit Daniel Matz 13.11 Blockder diagram of an automatic equaliser. Abbildung 4.4: Figure Blockdiagramm Merkmalsgewinnung des implementierten RMS(h 0,m (n )) > RMS(he 0(n)) cross-adaptiven Equalizers Cross-adaptive feature processing Table of hg 0,m (n) Cross-adaptive Merkmalsverarbeitung The same 0 1 cvk,m (n) = , H lk,m (n)xk,m (n) Probability mass M 1 function K 1 loudness Cross-adaptive processing mapping the perceptual loudness of each spectral Innerhalb desfeature CAFPB geht esconsists nun umofdie Zuordnung der häufigsten Lautheit f vk,mband (n) for xe(n) weighting all K filterslevel so that, by manipulating its gain per band level, to its amplitude we can achieve a desired coefficients jedes Frequenzbandes zuhgeinem Verstärkungsfaktor cvk,m (n) des korrespondierenden (n)*w(SP(n)) target loudness level. We aim to0,machieve an average loudness value l(n), therefore we must decrease SP(n ) EQ-Bandes Tracks.SHierdurch kannabove die Verstärkung eines spezifischen the loudness offür thealle equalisation bands for signals this average and increase the band Bandes gain for Avr(xl 0,m (n)) SPL (n) per signals belowwerden, this average. Thisdieses resultsletztendlich in a system indie which we have a multiplier beeinflusst sodass beabsichtigte Lautheitcva zuk,mbesitzt. measurement xl (n) band, such that we scale the input bands 0,m xk,m (n) in order to achieve a desired average loudness l(n). Dies erfolgtofintheAnlehnung anapproximated die cross-adaptive Verarbeitung wobei Fixed SPL k,mdes The function system1 can be l(n)/(cv (n)xCAMTS, k,m (n)), where the f v0,m (n) by H lk,m (n) = control vector cvk,m (n)eine is given by Ziel-Lautheit l(n) mit Hilfe der Mittelwertbildung zunächst ebenfalls mittlere über alle f vk,m (n) für 0alle K Frequenzbänderl(n) und M Signale bestimmt wird: 13.11 Blockder diagram of an automatic equaliser. Abbildung 4.4: Figure Blockdiagramm Merkmalsgewinnung des implementierten cross-adaptiven Equalizers Bachelorarbeit Daniel Matz , probable loudness per band of a given input and is denoted by f vk,m (n). The block diagram of this multi-band loudness feature extraction method is shown in Figure 13.11. Cross-adaptive feature processing Cross-adaptive Merkmalsverarbeitung H lk,m (n)xk,m (n) K 1 f vk,mused (n)/K /M the target average loudness (4.11) where cvk,m (n) is the controll(n) gain=variable per band to achieve m=0 k=0 l(n) and H lk,m (n) is the function of the desired system. The feature extraction block has a function M 1 cvk,m (n) = Cross-adaptive feature processing consists of mapping the perceptual loudness of each spectral Innerhalb des CAFPB geht es nun um die Zuordnung der häufigsten Lautheit f vk,mband (n) to its amplitude level so that, by manipulating its gain per band level, we can achieve a desired jedes Frequenzbandes zu einem Verstärkungsfaktor cvk,m (n) des korrespondierenden target loudness level. We aim to achieve an average loudness value l(n), therefore we must decrease EQ-Bandes Tracks. Hierdurch kannabove die Verstärkung eines spezifischen the loudness offür thealle equalisation bands for signals this average and increase the band Bandes gain for (n) per signals belowwerden, this average. Thisdieses resultsletztendlich in a system indie which we have a multiplier beeinflusst sodass beabsichtigte Lautheitcva zuk,mbesitzt. band, such that we scale the input bands xk,m (n) in order to achieve a desired average loudness l(n). Dies erfolgtofintheAnlehnung anapproximated die cross-adaptive Verarbeitung wobei The function system can be by H lk,m (n) = l(n)/(cvk,mdes (n)xCAMTS, k,m (n)), where the control vector cvk,m (n)eine is given by Ziel-Lautheit l(n) mit Hilfe der Mittelwertbildung zunächst ebenfalls mittlere über alle f vk,m (n) für alle K Frequenzbänderl(n) und M Signale bestimmt wird: Bachelorarbeit Daniel Matz 0 1 Probability mass function 0 Output most probable loudness per band: f vk,m (n) , (13.18) The algorithm as proposed in [2] automatically extracts pitch sequences of the solo instrument and uses them as prior information in the separation scheme. In order to We use the algorithm for harmonic-percussive separation proposed in [6]. The algorithm is based on median filtering of the magnitude spectrogram to split the original audio signal into its horizontal (harmonic sources) and vertical elements (percussive sources). In an automatic mixing context, these components can be understood as separate subgroups which can be processed individually and finally remixed. 4.1 Solo and Backing track Separation 4.2 Harmonic-percussive Separation (HPS) Figure 1 presents a block diagram of the proposed framework. There are three main signal pathways A, B, and C. If the CADSP is activated, pathway A is chosen. If CADSP is not activated, pathway B and C are chosen depending on whether the harmonic-percussive separation (HPS) is used. All signal paths output a stereo mix. In the following sections, the individual subcomponents are first described, followed by a description of the three proposed signal pathways. Stereo Remix 13.11 Blockder diagram of an automatic equaliser. Abbildung 4.4: Figure Blockdiagramm Merkmalsgewinnung des implementierten cross-adaptiven Equalizers H lk,m (n)xk,m (n) K 1 6. Time-frequency selective panning (TFSP) Normalize probable loudness per band of a given input and is denoted by f vk,m (n). The block diagram of this multi-band loudness feature extraction method is shown in Figure 13.11. Cross-adaptive feature processing Cross-adaptive Merkmalsverarbeitung M 1 cvk,m (n) = 5. Automatic excitation (AE) Normalize Stereo Remix Figure 1: Signal flow-chart of the developed automatic remixing framework For our mixing framework, we propose the use of sound source separation techniques as a pre-processing step of the mixing process. For this purpose, we first isolate the solo instrument from the audio mix by applying an algorithm for pitch-informed solo and accompaniment separation [2]. The two separated signals, i.e., the solo and the residual/backing signal, are the starting point for the automatic remixing process. Additionally, based on the requirements discussed in section 2, our proposed framework comprises six subcomponents: Cross-adaptive processing mapping the perceptual loudness of each spectral Innerhalb desfeature CAFPB geht esconsists nun umofdie Zuordnung der häufigsten Lautheit f vk,mband (n) to its amplitude level so that, by manipulating its gain per band level, we can achieve a desired jedes Frequenzbandes zu einem Verstärkungsfaktor cvk,m (n) des korrespondierenden target loudness level. We aim to achieve an average loudness value l(n), therefore we must decrease EQ-Bandes Tracks. Hierdurch kannabove die Verstärkung eines spezifischen the loudness offür thealle equalisation bands for signals this average and increase the band Bandes gain for (n) per signals belowwerden, this average. Thisdieses resultsletztendlich in a system indie which we have a multiplier beeinflusst sodass beabsichtigte Lautheitcva zuk,mbesitzt. band, such that we scale the input bands xk,m (n) in order to achieve a desired average loudness l(n). Dies erfolgtofintheAnlehnung anapproximated die cross-adaptive Verarbeitung wobei The function system can be by H lk,m (n) = l(n)/(cvk,mdes (n)xCAMTS, k,m (n)), where the control vector cvk,m (n)eine is given by Ziel-Lautheit l(n) mit Hilfe der Mittelwertbildung zunächst ebenfalls mittlere über alle f vk,m (n) für alle K Frequenzbänderl(n) und M Signale bestimmt wird: 4. Cross-adaptive dynamic spectral panning (CADSP) Mixdown Stereo? (true or false) f vk,mused (n)/K /M the target average loudness (4.11) where cvk,m (n) is the controll(n) gain=variable per band to achieve m=0 k=0 l(n) and H lk,m (n) is the function of the desired system. The feature extraction block has a function 3. Cross-adaptive equalization (CAEQ) Mixdown Mixdown Filters AE CAMTS Bachelorarbeit Daniel Matz 2. Cross-adaptive multi-track scaling (CAMTS) Normalize 4. PROPOSED METHOD obtain more accurate spectral estimates of the solo instrument, the algorithm creates tone objects from the pitch sequences, and performs separation on a tone-by-tone basis. Tone segmentation allows more accurate modeling of the temporal evolution of the spectral parameters of the solo instrument. The algorithm performs an iterative search in the magnitude spectrogram in order to find the exact frequency locations of the different partials of the tone. A smoothness constraint is enforced on the temporal envelopes of each partial. In order to reduce interference from other sources caused by overlapping of spectral components in the time-frequency representation, a common amplitude modulation is required for the temporal envelopes of the partials. Additionally, a post-processing stage based on median filtering is used to reduce the interference from percussive instruments in the solo estimation. 1. Harmonic-percussive separation (HPS) Stereo Split TFSP false Filters true Filters Filters Filters Backing Solo Backing Solo TFSP Stereo Split Mixdown Filters AE AE CADSP CAMTS Solo Backing Solo Backing Solo Backing Mixdown CAMTS false true true CAEQ CAEQ C B A CAEQ HPS Filters Filters Filters Filters Filters HPS Downmix Downmix Filters HPS ? (true or false) false CADSP ? (true or false) In [14] and [19], several cross-adaptive signal processing methods for automatic mixing such as source enhancer, panner, fader, equalizer, and polarity and time offset correction are proposed. These modules can be combined into a full mixing application. In [4], the authors propose a knowledge-engineered autonomous mixing system and propose to include expert knowledge within an automatic mixing system. The included audio effects are automatically controlled based on extracted low-level and highlevel features such as musical genre, instrumentation, and the type of sound sources. The authors evaluated the system using short four bar audio signals with vocals, bass, guitar, keyboard, and other instruments. Harmonic-percussive source separation was used as preprocessing step for manual remixing in [6], in particular to Filters adjust the sound source levels of the signals. To the authors’ best knowledge, a framework for automatic remixing that suits the requirements discussed in section 2 has not been proposed so far. Proceedings of the 16th ISMIR Conference, Málaga, Spain, October 26-30, 2015 4.3 Cross-adaptive Multi-track Scaling (CAMTS) The method proposed in [19] which is commonly referred to as automatic fader control, is used for automatic scaling of the sound sources. The algorithm is used to automatically modify the amplification of separate sound sources. A psychoacoustic model based on the EBU R-128 standard [9] is used to compute the loudness of each track using a histogram-based approach. All tracks are individually amplified to be perceived as equally loud. 4.4 Cross-adaptive Equalizer (CAEQ) We use the cross-adaptive equalizing algorithm proposed in [19] to obtain a spectrally balanced mixture. The main approach is to modify the spectral envelopes of the audio signals and to minimize the spectral masking between the solo signal and the backing track. The algorithm is a multi-band extensions of the CAMTS algorithm as discussed in section 4.3. The spectral characteristics of the separated signals are modified by enhancing or attenuating pre-defined frequency bands depending on the signal’s perceived loudness with respect to the overall loudness. In contrast to the CAMTS algorithm, the loudness model proposed in [19] is used since it outperformed the loudness model based on EBU R-128 during informal testing. In particular, the mix results based on EBU-R 128 showed too strong of an emphasis on treble frequencies while lacking energy in the lower frequency range. We use a 10-band octave equalizer with second-order biquad IIR filters following [19] and frequency bands uniformly distributed over the audible frequency range. Standard frequency values based on [8] are used to adjust the center frequencies of the peak filter as well as the cutoff frequencies of the shelving filters. 4.5 Cross-adaptive Dynamic Spectral Panning (CADSP) Dynamic spectral panning is a technique that allows the creation of a stereophonic impression in a given monophonic multi-track recording. We use the algorithm proposed in [15] to create a spatialization effect given multitrack signals. The method dynamically assigns time-frequency bins of the original tracks towards azimuth positions. The assignment reduces masking due to shared azimuth positions between multiple sound sources. This improves the overall transparency of an audio mix. In the cases where the original audio mix is a stereo track, it is first down-mixed to mono and then up-mixed to a new stereo image using the CADSP algorithm. 4.6 Automatic Exciter (AE) The exciting algorithm improves the assertiveness of the backing track. The digital signal processing methods are implemented following the APHEX Aural Exciter described in [18]. The audibility of the mixed signal is enhanced by adding harmonic distortions in the upper frequency range. These distortions create additional harmonic signal com- 751 ponents which improve the presence, clarity, and brightness of the audio signal. The automation of the exciting step is implemented following a target mixing approach. Based on [5], the mixing parameters are iteratively adjusted to a target energy ratio. The target energy ratio is computed from the relationship between the energy of the high-pass filtered signal and the energy of the target signal. In the side chain, an asymmetric soft clipping characteristic, harmonic generator block, with adaptive threshold was used. This allows a level-independent distortion as well as the preservation of the signal dynamics [5]. 4.7 Time-frequency selective Panning (TFSP) Time-frequency selective panning improves the stereo image as well as the overall spatial impression of an audio mix. In our framework, the method for time-frequency selective panning presented in [3] was used. The azimuth positions of the sound sources are modified using a non-linear warping function. The stereo image is widened while the initial arrangement of the sound sources, as well as the sound quality of the original source is maintained. Within the proposed automatic remixing framework, the TFSP algorithm can be interpreted as an extension of the CADPS algorithm. The panning algorithm is only applied to the residual signal (see section 4.8.1). We set the aperture parameter ⇢ to a fixed value based on initial informal testing. 4.8 Processing Pathways 4.8.1 Signal path A (Main Path) The main processing path includes all system components. Stereo files must be down-mixed to mono first due to constraints of the cross-adaptive dynamic spectral panning (CADSP) algorithm as detailed in section 4.5. All sound sources, which are initially distributed in the stereo panorama, are first centered to the mono channel and later redistributed over the stereo panorama again based on the harmonic-percussive sound separation [6]. This up-mixing step that can involve a modification of the stereo arrangement is only possible in this signal path. The cross-adaptive equalization (CAEQ) and multi-track scaling (CAMTS) are the first processing steps in all three pathways. After applying the dynamic spectral panning (CADSP) to the percussive and harmonic signal components, all stereophonic signals are summed up to a backing track with a more homogeneous distribution of the sound sources. The backing track can now be processed with the automatic excitation (AE) and the time-frequency selective panning (TFSP) algorithms. The solo signal is split into stereo channels in the Stereo Split stage and scaled such that the overall gain remains constant. In the final mixdown step, the backing track is mixed with the solo track by adjusting the individual amplification factors as given by the CAMTS stage. If the cross-adaptive equalization (CAEQ) was performed, the spectral envelope of the backing track is perceivably modified due to the minimization 752 Proceedings of the 16th ISMIR Conference, Málaga, Spain, October 26-30, 2015 of the spectral masking. The stereo sum signal is finally normalized. Gender 4.8.2 Signal path B Hearing impairment? Signal path B resembles signal path A, however, the equalization (CAEQ) and scaling (CAMTS) steps offer more ways to modify parameters due to the prior harmonic percussive separation stage. Listening test experience? 4.8.3 Signal path C Educational background in music? In the signal path C, no harmonic-percussive separation is performed. The equalization (CAEQ) and scaling (CAMTS) are applied to both the backing and the solo track. However, the automatic excitation is only applied to the backing track since we particularly want to enhance the presence, clarity, and brightness of the backing track. As shown in figure 1, the time-frequency selective panning (TFSP) can only be applied to the backing track if it is a stereo signal. For monaural signals, the signal is split to the stereo channels (Stereo Split) and scaled such that the overall gain remains constant. Similar to signal path B, the signals are finally mixed down and normalized. 5. EVALUATION 5.1 Experimental Design To evaluate the proposed framework, a listening test procedure was conducted following the guidelines of the Multi Stimulus Test with Hidden Reference and Anchor (MUSHRA) described in the ITU-R BS.1534-2 recommendation [11], and modifying them to fit the characteristics of this study. The main difference of our test with respect to the original MUSHRA is that a reference signal, which in our case would be an ideal mix of the original recording, is not available. Moreover, the notion of an ideal mix is ill-posed in the automatic remixing context. The listening test was conducted in a quiet room and all signals were played using open headphones (AKG K 701). A total of 19 participants conducted the listening test. The participants included audio signal processing experts, professional audio engineers, music students (jazz, classical music), musicologists, as well as amateur musicians and regular music consumers. The average age of the participants was 30.7 years old. Further demographic information such as gender, hearing impairments, listening test experience, and educational background were also collected. A summary of the demographic information is presented in table 1. The listening test was divided into five evaluation tasks, each focusing on a different subjective quality parameter. The following parameters were selected based on the ITUR BS.1248-1 recommendation [10], and were adopted to our requirements: (QP1) Sound Balance, (QP2) Transparency, (QP3) Stereo/Spatial Impression, (QP4) Timbre, and (QP5) Overall Impression. In each evaluation task, a training phase was first conducted to allow the participants to familiarize themselves with the test material and to adjust playback levels to a comfortable one. Expert in audio engineering? M F Yes No Yes No Yes No Yes No 16 3 0 19 9 10 11 8 15 4 Table 1: Demographics of the listening test participants Following the training phase, an evaluation phase was conducted for each task. Five audio tracks as described in Table 2 with ten mixtures each were rated by the participants. The five tracks used in this study are part of the Jazzomat Database 1 . Among the presented mixtures, the original signal, eight mixes created with different configurations of the proposed framework, and an anchor signal (rhythm section reduced by 6 dB, the sum signal low-pass filtered at 3.5 kHz) were used. Table 3 gives an overview of all the remix configurations. Title Body and Soul Tenor Madness Crazy Rhythm Bye Bye Blackbird Adam’s Apple Soloist (Instrument) Chu Berry (ts) Sonny Rollins (ts) J.J. Johnson (tb) Ben Webster (ts) Wayne Shorter (ts) Style Swing Hardbop Bebop Swing Postbop Year 1938 1956 1957 1959 1966 Table 2: Dataset description Mix 1 2 3 4 5 6 7 8 (mono) HPS off off off on on on on on CAEQ on off on on off off on on CAMTS off on on off on off on on CADSP off off off off off on on off AE on on on on on on on on TFSP off off off off off on on off Table 3: Configurations of the eight remixes used in the listening test The automatic exciting (AE) component is active in all the mixes. The panning (TFSP) algorithm is only activated in conjunction with the cross-adaptive dynamic spectral panning (CADSP). This way, a further stereo expansion of critical stereo recordings with an unbalanced stereo panorama is avoided. Mixture 8 was added to investigate the influence of the stereo effects (CADSP and TFSP) onto the input signals in the pre-processing step of pathway B that are mixed monophonic. 1 A description of the Jazzomat Database is available at: http:// jazzomat.hfm-weimar.de/dbformat/dbcontent.html S) Mix 2 (CAMTS) Mix 2 (CAMTS) Mix 2 (CAMTS) Mix 2 (CAMTS) Mix 2 (CAMTS) Mix 2 (CAMTS) Frequenzgang und Klangfarbe Transparenz s Gleichgewicht Stereofoner und räumlicher Eindruck Akustischer Gesamteindruck Akustischer Gesamteindruck HPS+CAEQ+CAMTS(mono) HPS+CAEQ+CAMTS(mono) Q+CAMTS(mono) HPS+CAEQ+CAMTS(mono) HPS+CAEQ+CAMTS(mono) HPS+CAEQ+CAMTS(mono) HPS+CADSP+TFSP HPS+CADSP+TFSP P HPS+CADSP+TFSP HPS+CADSP+TFSP HPS+CADSP+TFSP +CAMTS) Mix Mix 3 (CAEQ+CAMTS) Mix 3 (CAEQ+CAMTS) 3 (CAEQ+CAMTS) Mix Mix 3 (CAEQ+CAMTS) 1 3 (CAEQ+CAMTS) 1 1HPS+CAEQ+CAMTS+CADSP+TFSP 1 3 (CAEQ+CAMTS)1Mix 00 100 100 HPS+CAEQ+CAMTS+CADSP+TFSP S+CADSP+TFSP HPS+CAEQ+CAMTS+CADSP+TFSP HPS+CAEQ+CAMTS+CADSP+TFSP HPS+CAEQ+CAMTS+CADSP+TFSP AEQ) Mix 4 (HPS+CAEQ) Mix 4 (HPS+CAEQ) Mix 4 (HPS+CAEQ) Mix 4 (HPS+CAEQ) Mix 4 (HPS+CAEQ) Mix 4 (HPS+CAEQ) 1 Excellent(Sehr Gut) Excellent(Sehr Gut) Frequenzgang und Klangfarbe Transparenz es Gleichgewicht Stereofoner und räumlicher Eindruck Akustischer Gesamteindruck Akustischer Gesamteindruck HPS+CAEQ+CAMTS(mono) HPS+CAEQ+CAMTS(mono) S(mono) HPS+CAEQ+CAMTS(mono) HPS+CAEQ+CAMTS(mono) HPS+CAEQ+CAMTS(mono) AMTS) Mix Mix 5 (HPS+CAMTS) 5 (HPS+CAMTS) Mix 5 (HPS+CAMTS) 5 (HPS+CAMTS) Mix 5 (HPS+CAMTS) Mix Mix 5 (HPS+CAMTS) 100 100 100 ADSP+TFSP) Mix 6 (HPS+CADSP+TFSP) Mix 6 0.8 (HPS+CADSP+TFSP) Mix 6 (HPS+CADSP+TFSP) Mix 6 (HPS+CADSP+TFSP) Mix 6 (HPS+CADSP+TFSP) Mix 6 (HPS+CADSP+TFSP) 0.8 0.8 0.8 0.8 t) Excellent(Sehr Gut) Frequenzgang und Klangfarbe Transparenz chgewicht Stereofoner undMix räumlicher Eindruck Akustischer Gesamteindruck Akustischer Gesamteindruck AEQ+CAMTS+CADSP+TFSP) Mix 7 (HPS+CAEQ+CAMTS+CADSP+TFSP) 7 Gut) (HPS+CAEQ+CAMTS+CADSP+TFSP) Mix Excellent(Sehr 7 (HPS+CAEQ+CAMTS+CADSP+TFSP) Mix 7 (HPS+CAEQ+CAMTS+CADSP+TFSP) Mix 7 (HPS+CAEQ+CAMTS+CADSP+TFSP) Mix Original 7 (HPS+CAEQ+CAMTS+CADSP+TFSP) 0.8 100 80 Mix 8 (HPS+CAEQ+CAMTS, AEQ+CAMTS, mono) mono) Mix 880 (HPS+CAEQ+CAMTS, mono) Málaga, Mix 8 (HPS+CAEQ+CAMTS, mono) Mix 8 (HPS+CAEQ+CAMTS, mono) Mix 8 (HPS+CAEQ+CAMTS, mono) Proceedings of the100 16th ISMIR Conference, Spain, October 26-30,mono) 2015 Mix Anchor 8 (HPS+CAEQ+CAMTS, Mix 1 (CAEQ) ut) 0.6 Excellent(Sehr 0.6 Excellent(Sehr0.6 Gut) Gut) 0.6 0.6 Mix 2 (CAMTS) Original 80 80Eindruck 80 0.6räumlicher es Gleichgewicht Stereofoner und Frequenzgang und Klangfarbe Transparenz Akustisches Gleichgewicht Akustischer Gesamteindruck Akustisches Gleichgewicht Mix 3 (CAEQ+CAMTS) Anchor Excellent(Sehr Gut) 100 Gut) 00 Excellent(SehrGood(Gut) 100Good(Gut) 100 Mix 4 (HPS+CAEQ) Mix 1 (CAEQ) 0.4 0.4 0.4 0.4 0.4 80 80 80 Mix 5 (HPS+CAMTS) Mix 2 (CAMTS) Original 0.4 Mix 6 (HPS+CADSP+TFSP) t) Excellent(Sehr Gut) Good(Gut) ut) Excellent(Sehr Gut) Good(Gut) Mix 3 (CAEQ+CAMTS) Excellent(Sehr Gut) Anchor Mix 7 (HPS+CAEQ+CAMTS+CADSP+TFSP) Mix 4 (HPS+CAEQ) 80 60 Mix 1 (CAEQ) 80 60 0.2 0.2 0.2 0.2 0.2 Mix 8 (HPS+CAEQ+CAMTS, mono) Mix 5 (HPS+CAMTS) ut) Good(Gut) 80 80 80 Mix 2 (CAMTS) 0.280 Good(Gut) Mix 6 (HPS+CADSP+TFSP) Mix 3 (CAEQ+CAMTS) 60 60 60 Akustisches Gleichgewicht Mix 7 (HPS+CAEQ+CAMTS+CADSP+TFSP) (HPS+CAEQ) Good(Gut) Good(Gut) 0 Fair(Ausreichend) ut) Good(Gut) Good(Gut) 0 Fair(Ausreichend) 0 0 Good(Gut)0Mix 4 100 Mix 8 (HPS+CAEQ+CAMTS, mono) Mix 5 (HPS+CAMTS) 0 60 60 60 Mix 6 (HPS+CADSP+TFSP) Akustisches Gleichgewicht d) Fair(Ausreichend) Fair(Ausreichend) 60 60 Excellent(Sehr 60 60 Mix 7Gut) (HPS+CAEQ+CAMTS+CADSP+TFSP) −0.2 −0.2 100 −0.2 −0.2 −0.2 60 40 Mix 8 (HPS+CAEQ+CAMTS, 60 40 mono) −0.2 ChuFair(Ausreichend) Berry J.J.Johnson Wayne Shorter All nd) Fair(Ausreichend) 80Sonny Rollins d) Fair(Ausreichend) Fair(Ausreichend) Fair(Ausreichend) 753 Excellent(Sehr Gut) Gleichgewicht 40 −0.4 40Akustisches 40 −0.4 −0.4 −0.4 Wayne Shorter Good(Gut) −0.4 100 Chu Berry J.J.Johnson Sonny Rollins All Fair(Ausreichend) Fair(Ausreichend) 40 40 Poor(Mangelhaft) 40 −0.440 80 Poor(Mangelhaft) 40 40 40 60 xcellent(Sehr Gut) Chu Berry −0.6 J.J.Johnson−0.6 Sonny Rollins Wayne Shorter All −0.6 −0.6 Good(Gut) ft) Poor(Mangelhaft) Poor(Mangelhaft) t) −0.6 Poor(Mangelhaft) Poor(Mangelhaft) Poor(Mangelhaft) −0.6 40 40 20 Fair(Ausreichend) 20 80 J.J.Johnson Sonny Rollins Wayne Shorter All 60 aft) Chu BerryPoor(Mangelhaft) Poor(Mangelhaft) 20 20 20 20 −0.8 −0.8 −0.8 −0.8 40 −0.8 Good(Gut) 20 20 20 −0.8 Fair(Ausreichend) Poor(Mangelhaft) Poor(Mangelhaft) ht) Bad(Schlecht) Bad(Schlecht) Bad(Schlecht) Bad(Schlecht) Bad(Schlecht) Poor(Mangelhaft) 60 20 20 40 −1 −1 20 −1 −1 −1 1 2 35 0 1 46 8 2 157 3 268 4 1379 5 1.5 6 59 2 7 10 10 9 1 2 0 3 4−1 Bad(Schlecht) 10 4 6 2.5 8 7 39 8 3.5 4 10 0 t)0 Bad(Schlecht) 20 Fair(Ausreichend) Sonny Rollins Wayne Shorter All Chu Berry J.J.Johnson Sonny Rollins Wayne Shorter Ben Webster Chu Berry J.J.Johnson Sonny Rollins Wayne Shorter J.J.Johnson Sonny Rollins Wayne Shorter All Ben Webster Chu Berry J.J.Johnson Sonny Rollins Wayne Ben Webster Chu Berry J.J.Johnson Sonny Rollins Wayne Shorter All Shorter Transparency Stereo/Spatial Impression Sound Timbre 1 20 4 5 Berry 6 All 7 8All 9 Overall 10 Ben3Webster Chu J.J.Johnson Sonny Rollins Wayne Shorter All Impression 20 02 Balance 0 Poor(Mangelhaft) Impression Ben Webster ChuWebster Berry J.J.Johnson Sonny Rollins Wayne Shorter J.J.Johnson Sonny Rollins Wayne Shorter All Sonny Rollins Shorter AllBerry Chu Berry J.J.Johnson Sonny Rollins Stereo/Spatial Wayne Shorter All Ben Chu Berry J.J.Johnson SonnyAllRollins Ben ChuWebster J.J.Johnson Sonny Rollins Wayne Shorter ht) Bad(Schlecht) Bad(Schlecht) Wayne Bad(Schlecht) 40 20 0 Ben Webster ChuWebster Berry J.J.Johnson Sonny Rollins Sonny Rollins Wayne Shorter Chu Berry J.J.Johnson Ben Bad(Schlecht) Bad(Schlecht) Poor(Mangelhaft) Bad(Schlecht) 0 0 Ben ChuWebster Berry J.J.Johnson Sonny Rollins Sonny Rollins Wayne Shorter Chu Berry J.J.Johnson Ben 20 Webster 0 Bad(Schlecht) Ben Webster J.J.Johnson Sonny Rollins Chu Berry 0 4.5 5 All WayneAll Sho 0 0 0 Ben Webster 5.2 Results ChuWebster Berry Sonny Rollins Wayne Shorter J.J.Johnson Ben 5.2.1 General Chu Berry Ben Webster 0 J.J.Johnson Sonny Rollins Wayne Shorter Wayne Shorter AllBerry Sonny Rollins test results for Wayne Shorter All Ben Chu J.J.Johnson ChuWebster J.J.Johnson Sonny Rollins Figure02: Listening theAllBerry five evaluated parameters. Ben Webster Chu Berry J.J.Johnson Sonny Rollins 0 J.J.Johnson Wayne Shorter AllBerry Sonny Rollins Ben ChuWebster Chu Berry J.J.Johnson Wayne Shorter AllBerry Sonny Rollins Ben ChuWebster J.J.Johnson Sonny Rollins Wayne Shorter ChuAll Berry J.J.Johnson J.J.Johnson 5.2.2 Sonny Rollins Wayne Shorter ChuAll Berry J.J.Johnson Sonny Rollins Figure 2 shows the results of the listening test for the five acoustic quality parameters. The figure legend summarizes all the system configurations that were evaluated. It is evident from the plot that the anchor stimulus was always correctly identified. Results also suggest that the use of harmonic-percussive separation does not bring perceptual quality gains for HPS+CAEQ (mix 4) when compared to the CAEQ (mix 3). Unexpectedly, results even got worse for the parameters timbre and overall impressions. Similarly, the combined settings in HPS+CAMTS (mix 5) do not show an improvement in the ratings when compared to CAMTS (mix 2). To facilitate analysis of results, table 4 lists the percentage improvement obtained for each of the five quality parameters (QP), subject to the presence or absence of the individual framework components compared to the original signal. Mixes 4 and 5, which include the harmonicpercussive separation, were not listed due to the reasons previously described. The five mixtures listed in the table are further analyzed in the following sections. Mix 1 (CAEQ) Mix 2 (CAMTS) Mix 3 (CAEQ+CAMTS) Mix 6 (HPS+CADSP+TFSP) Mix 7 (All components) QP1 18 % 15 % 10 % 9% 29 % QP2 17 % 19 % 12 % 16 % 24 % QP3 4% 10 % 6% 18 % 43 % Sonny Rollins QP4 16 % 3% QP5 9% 4% 8% 6% Table 4: Percentage improvement of the remixed signal compared to the original audio recording subject to the presence (or absence) of the individual framework components shown for each of the five perceptual quality parameters. SonnyAllRollins Wayne Shorter Wayne Shorter Wayne Shorter All J.J.Johnson Sonny Rollins Wayne Shorter Mix 1 (CAEQ) Wayne Shorter All J.J.Johnson Sonny Rollins SonnyAllRollins Wayne Shorter All SonnyAllRollins Wayne Shorter Mix 1 does not include a prior separation of the residual All component and outperforms the original mix for most of the quality parameters. The highest improvements were 18% for sound balance and 17% for transparency. However, for timbre and overall impressions, no improvement was observed. Wayne Shorter 5.2.3 Mix 2 (CAMTS) Despite the absence of the harmonic percussive separation step, mix 2 showed improvements for transparency (19 %), sound balance (15%), and overall impression (9 %). The reason for the improvement in timbre by 16% is not entirely clear in this case; however, a possible explanation is that the increased loudness of the rhythm section led to more balanced dynamic levels and a clearer perception of the instrument and overall timbres. 5.2.4 Mix 3 (CAEQ+CAMTS) The combination of the CAEQ and CAMTS components showed inferior results compared to the exclusive application of both components. However, the ratings are still slightly higher than the ratings of the original audio file. 5.2.5 Mix 6 and Mix 7 Both mixtures 6 and 7 outperformed the original audio file. The highest ratings were achieved with mixture 7 which was extracted with the full processing chain. In particular, the improvements compared to the original audio file were 29 % for sound balance, 24 % for transparency, as well as 43 % for stereo and spatial impression. The small improvements with respect to the overall impression are likely due to the individual aesthetic preferences of the listening test participants. Additionally, to analyze the influence of the stereo effects to the input signals of pathway B (which are initially WayneAll S All WayneA WayneAll Shorter 754 Proceedings of the 16th ISMIR Conference, Málaga, Spain, October 26-30, 2015 downmixed to mono), Table 5 presents the percentage improvement obtained with mix 7 (all components active) in comparison to mix 8 (mono). QP 1 39 % QP 2 16 % QP 3 33 % QP 4 12 % QP 5 18 % Table 5: Mean ratings of the five quality parameters for the additional usage of the stereo effects (CADSP+TFSP) in mix 7 compared with the non-processed monophonic input signal in the same framework setting of mix 8 (HPS, CAEQ, CAMTS, AE). As can be observed in the table, the use of the CADSP and TFSP modules improved the ratings for all five quality parameters. The improvement was statistically significant for sound balance (39 %) and stereo/spatial impression (33 %). 6. CONCLUSIONS In this paper, we proposed a prototype implementation of an automatic remixing framework for tonal optimization of early jazz recordings. The main focus was on improving the balance between the solo instrument and the rhythm section. The framework consists of six components which include different processing steps to modify the loudness, frequency response, timbre, and stereophonic perception of the separated sound sources. We compared different configurations of the framework and evaluated the improvement of the transparency of the backing track as well as the acoustic balance, stereophonic homogeneity, and overall quality perception. The evaluation was performed with a MUSHRA-like listening test based on the ratings given by 19 participants. The usage of automatic equalization (CAEQ) and multitrack scaling (CAMTS) showed clear improvement in the quality parameter ratings, whereas the combination of both led to a smaller improvements than the independent application of each approach. The improvement based on harmonic-percussive separation (HPS) within the automatic mixing framework is not easy to assess. The usage of HPS in conjunction with CAEQ and CAMTS did not improve the ratings. On the other hand, HPS is a basic requirement for the application of CADSP on the backing track of mix 7, and therefore contributes to its consistent high ratings. HPS is irrelevant for the automatic excitation (AE) step, since it is applied to the full residual track. Particularly with mix 7 (all components), the initially targeted improvements in sound balance, stereo and spatial impression, and transparency with respect to the original audio recording were achieved. In future work, the most relevant processing modules must be further investigated and improved with respect to the aforementioned quality parameters. Modules that showed none or only minor improvements must be replaced and alternative algorithms must be evaluated for the given tasks. Promising algorithms seem to be a mastering equalizer [7] or dynamic range compression [13]. The additional use of semantic information of music genre and instrumentation seems to be another fruitful approach as discussed in section 3. Finally, the integration of audio restoration methods such as denoising will likely help to remove unwanted background noise and spurious signals from the main signal to be processed. 7. REFERENCES [1] Daniele Barchiesi and Joshua D. Reiss. Automatic target mixing using least-square optimization of gains and equalization settings. In Proceedings of the 12th International Conference on Digital Audio Effects (DAFx09), Como, Italy, 2009. [2] Estefanía Cano, Gerald Schuller, and Christian Dittmar. Pitch-informed solo and accompaniment separation: towards its use in music education applications. EURASIP Journal on Advances in Signal Processing, 23:1–19, 2014. [3] Maximo Cobos and Jose J. Lopez. Interactive enhancement of stereo recordings using time-frequency selective panning. In Proceedings of the 40th AES International Conference on Spatial Audio, Tokyo, Japan, 2010. [4] Brecht De Man and Joshua D. Reiss. A semantic approach to autonomous mixing. Journal of the Art of Record Production, 8, 2013. [5] Brecht De Man and Joshua D. Reiss. Adaptive control of amplitude distortion effects. In Proceedings of the 53rd AES International Conference on Semantic Audio, London, UK, 2014. [6] Derry FitzGerald. Harmonic/percusssive separation using median filtering. In Proceedings of the 13th International Conference on Digital Audio Effects (DAFx), Graz, Austria, 2010. [7] Marcel Hilsamer and Stefan Herzog. A statistical approach to automated offline dynamic processing in the audio mastering process. In Proceedings of the 17th International Conference on Digital Audio Effects (DAFx-14), Erlangen, Germany, 2014. [8] ISO International Organization for Standardization. Acoustics - preferred frequencies, August 1997. [9] ITU Radiocommunication Bureau. Algorithms to measure audio programme loudness and true-peak audio level (rec. itu-r bs.1770-3), August 2012. [10] ITU Radiocommunication Bureau. General methods for the subjective assessment of sound quality (rec. itur bs.1248-1), December 2003. [11] ITU Radiocommunication Bureau. Method for the subjective assessment of intermediate quality level of audio systems (rec. itu-r bs.1534-2), June 2014. Proceedings of the 16th ISMIR Conference, Málaga, Spain, October 26-30, 2015 [12] Zheng Ma, Joshua D. Reiss, and Black, Dawn A. A. Implementation of an intelligent equalization tool using yule-walker for music mixing and mastering. In Proceedings of the 134th AES Convention, Rome and Italy, 2013. AES. [13] Stylianos-Ioannis Mimilakis, Konstantinos Drossos, Andreas Floros, and Dionysios Katerelos. Automated tonal balance enhancement for audio mastering applications. In Proceedings of the 134th AES Convention, Rome, Italy, 2013. AES. [14] Enrique Perez Gonzales. Advanced Automatic Mixing Tools for Music. PhD thesis, Queen Mary University of London, London. UK, 30.09.2010. [15] Pedro D. Pestana and Joshua D. Reiss. A crossadaptive dynamic spectral panning technique. In Proceedings of the 17th International Conference on Digital Audio Effects (DAFx-14), Erlangen, Germany, 2014. [16] Jeffrey Scott and Youngmoo E. Kim. Analysis of accoustic features for automated multi-track mixing. In Proceedings of the 12th International Society for Music Information Retrieval Conference (ISMIR), Miami, USA, 2011. [17] Jeffrey Scott, Matthew Prockup, Erik M. Schmidt, and Youngmoo E. Kim. Automatic multi-track mixing using linear dynamical systems. In Proceedings of the 8th Sound and Music Computing Conference (SMC), Padova, Italy, 2011. [18] Priyanka Shekar and Smith, III, Julius O. Modeling the harmonic exciter. In Proceedings of the 135th AES Convention, New York, USA, 2013. [19] U. Zölzer. DAFX: Digital Audio Effects. John Wiley & Sons Ltd., second edition, 2011. 755