A preliminary study on the subjective evaluation of the recording quality of motion cameras Chen Xiayu Address: School of Mechanical Engineering, Shanghai Jiao Tong University, Shanghai, China, 200240 email: c2g1816525966@sjtu.edu.cn Jin Yumeng* (Co first author,contribute equally) Address: School of Mechanical Engineering, Shanghai Jiao Tong University, Shanghai, China, 200240 e-mail: 20030708@sjtu.edu.cn Huang Yu* (Corresponding author) Address: Institute of Vibration, Shock and Noise, School of Mechanical Engineering, Shanghai Jiao Tong University, Shanghai, China, 200240 e-mail: yu_huang@sjtu.edu.cn The demand for the high sound quality of the recordings of motion cameras is increasing with the rapid development of we media. Sound recorded by two devices with similar electroacoustic indicators (e.g. distortion, dynamic range, frequency response, etc.) may have significantly different subjective feelings of listeners. Therefore, this study investigated the influence of various psychological factors on the subjective preference of audio recordings acquired by two sports cameras. Fifty subjects compared 14 pairs of audio records (i.e., six pairs of instrumental music, five vocal and three environmental sounds) by paired comparison method and evaluated the multidimensions including timbre (dark/light, warm/cold), spatial impression, stereo impression (location accuracy, stability), sound balance, high transparency, medium transparency and low transparency on corresponding category rating scales for all 28 pieces of records in balanced random orders. There were significant differences in the subjective preference of the two devices. The dimensions in timbre, location accuracy, sound balance, and medium transparency significantly impacted the subjective preference. The nonlinear curve fitting models accounting for relation between preference and multi-dimension scores were formulated for three groups of stimuli according to their types (i.e. instrumental, vocal and environmental sounds). Keywords: recording quality, audio quality subjective evaluation, preference, multi-dimension 1. Introduction The high penetration rate of the Internet has led to the vigorous development of self-media (we media).[1] The 28th International Congress on Sound and Vibration (ICSV28), 24-28 July 2022 1 The demand for the high sound quality of the recordings of motion camera that is an important equipment of we media is increasing.[2] The sound quality of record and replay is determined by a series of electromechanical parameters, containing harmonic distortion, frequency response curve, transient distortion, phase distortion, group delay, etc.[3] Besides, various descriptors have been proposed and investigated to establish and evaluate sound’s subjective feelings. [4] Studies have found relationships between objective parameters and subjective indices, e.g. the clarity and frequency characteristics of the speaker, the sense of space and transient characteristics, and the softness and transient characteristics and frequency characteristics.[5] Although the technologies of objective measurement and subjective evaluations are relatively mature now, most of them are for the loudspeakers[6]. For the motion camera, empirically, the subjective feelings of two the sound quality of different devices might differ significantly even if their primary objective parameters are similar. Here we conduct an experiment on the sound quality of several types of sounds recorded by two motion cameras having similar objective parameters. The subjective ratings on multi-dimensions and subjective preference were investigated to find out thefactors that have a significant impact on subjective preference. Tentitave nonlinear models estimate the preference of sound quality of devices according to the numerical values of various dimensions. 2. Methods 2.1 Apparatus Experiment was performed in a semi-anechoic chamber. The apparatus used for the listening experiment consisted of a desktop computer (Thinkpad T450s), a digital analog converter (RME ADI-2 DAC FS) and a pair of headphones (Sennheiser HD600). We used foobar2000 software to control the playback duration, playback interval, and playback order of stimuli. The computer volume was 100 and the gain of DAC was set to –20 dB. 2.2 Stimuli We used fourteen audio records as test stimuli: six instrumental music, five vocal and three environmental records. The environmental sounds were field recording with two sports cameras (i.e. devices A and B). The other sounds were bought on a high quality music website (Qobuz, www.qobuz.com) and recorded by A and B through loudspeaker playing back in the anechoic chamber. All test stimuli were acquired at 24 bit and 96 kHz sample rates. The average sound pressure level of each stimulus was set to 60 dBA by adjusting the amplitude of records and calibrating via the headphones on a dummy head (Head Acoustics HMS IV). Figure 1 shows the calibration setup. Figure 1: The experimental setup and the dummy for calibration. The 28th International Congress on Sound and Vibration (ICSV28), 24-28 July 2022 2 2.3 Participants Fifty healthy participants, including 33 males and 17 females aged 18 to 28 years, attended the experiment. They were all students who had no symptoms of ear disease, no blockage in the ear canal, no history of excessive noise exposure, no ototoxic drugs or family hearing diseases. All participants signed an informed consent form before the test. 2.4 Protocol The subjective experiment contains four parts, i.e. Part Ⅰ–Ⅳ, as shown in Table 1. In Part Ⅰ–Ⅲ, participants evaluated each dimension on a rating scale with different dimensions adopted in different parts of experiment. The questions for multiple dimensions were formulated according to Recommendation of ITU[4]: Timbre (two dimensions)—Characterize the characteristics of different sounds of the recorded sound, such as dark or light, warm or cold. Stereo impression (two dimensions)—Characterizes whether the positions of each sound source referenced by the original stimuli can be accurately positioned during playback, and whether the positioning remains stable throughout the process. Spatial impression (one dimension)—Using the original stimuli as a reference, the sense of space represents whether the sound and image of the recorded sound are spatially appropriate in line with the imaginary spatial size. Sound balance (one dimension)—Characterize whether the sounds emitted by different individual sound sources have good balance and harmony in the overall sound after recording. Transparency (three dimensions)—Characterize the degree of resolution of the recorded sounds of different frequencies, such as whether the bass has impact, whether the midrange is bright and clear, and whether the treble is loud. Table 1: Evaluation material and evaluation dimensions Part Stimuli index The type of stimuli Evaluation dimensions Timbre (dark/light) Timbre (warm/cold) stereo impression (location accuracy) stereo impression (stability) spatial impression sound balance high transparency medium transparency low transparency Part Ⅰ (instrument) A01,02,06,09–11 B01,02,06,09–11 percussion instrument Jazz (brass instrument) Orchestral symphony Guitar pop with the audience String quartet Solo piano Part Ⅱ (vocal) A03–05,07 B03–05,07 Pure Vocal (Male) Chorus (male and female) Oratorio (Female) Pure Vocal (Female) Timbre (dark/light) Timbre (warm/cold) spatial impression medium transparency Part Ⅲ (environment) A12–14 B12–14 Lighting Traffic Park stereo impression (location accuracy) stereo impression (stability) sound balance All stimuli A01–14, B01–14 All Subjective preference The 28th International Congress on Sound and Vibration (ICSV28), 24-28 July 2022 3 The pairwise comparison method[6] was adopted in part Ⅳ to compare the preference between stimuli of devices A and B. Each participant listened to each pair of stimuli and reported which one they preferred. All pairs of stimuli were played twice in random order. 2.5 Data analysis In a pairwise comparison test, if one prefers A to B, let the score of A be 2 and B be 0, and vice verser; if A euqals B, let both be 1. The subjective preference for each stimulus would be the summation of fifty participants’ scores. The statistical analyses were completed by IBM SPSS Statistics (version 25). ShapiroWilk tests refused the normal distribution assumption for the data sets (p<0.05, Shapiro-Wilk), so we employed the nonparametric statistics for data analysis [8]. Wilcoxon test was used for the post-hoc test on preference and each dimension scores between two specific groups of stimuli. The significance level (i.e. the p-value) was adjusted for multiple comparisons by Bonferroni correction. 3. 3.1 Result and discussion Preference and multiple dimensions scores Figure 2 demostrates the sum of all 50 participants’ preference scores. The scores of device A is significantly higher than that of device B (**p<0.01, Wilcoxon test). Figures 3, 4 and 5 describe the results of the multiple dimensions for each stimuli of Parts Ⅰ, Ⅱ, and Ⅲ, respectively. Figure 2: Subjective preference scores of two devices.** p<0.01 with Wilcoxon test. For instrumental musics in Part Ⅰ, stimuli A01 performed better in stereo impression (stability) than B01 (p=0.016, Wilcoxon test). For vocal sounds in Part Ⅱ, a significant difference in spatial impression was found between A03 and B03 (pure male voice, windless state, anechoic chamber; p<0.001, Wilcoxon test). A03 also performed better than B03 in dark/light (p=0.060 Wilcoxon test). For environmental sounds in Part Ⅲ, a significant difference in location accuracy was found between A13 and B13 (traffic flow, winded state, outdoor; p<0.05, Wilcoxon test) and in the sound balance between A14 and B14 (park, winded, outdoor; p<0.05, Wilcoxon test). However, no significant difference was found for any other dimensions between stimuli of devices A and B (p>0.05, Wilcoxon test). The 28th International Congress on Sound and Vibration (ICSV28), 24-28 July 2022 4 Figure 3: Boxplots (median and interqutile ranges) for each pair of stimuli at each dimension, Part I (instrumental music). * p<0.05 with Wilcoxon test. Figure 4:Boxplots (median and interqutile ranges) for each pair of stimuli at each dimension, Part II (vocal). ***p<0.001 with Wilcoxon test. The 28th International Congress on Sound and Vibration (ICSV28), 24-28 July 2022 5 Figure 5:Boxplots (median and interqutile ranges) for each pair of stimuli at each dimension Part Ⅲ (environmental sounds). *p<0.05 with Wilcoxon test. 3.2 The preference model From above results, it seems that the rating scores in most dimensions do not differ much. Therefore, the significant difference in preference might attribute to a comprehensive mechanism that partly reflected from the multiple dimension ratings. Preliminarily, we formulated nonlinear regression models that account for the relationship between preference and multiple dimension scores for each of three groups of stimuli (i.e., six pairs of instrumental music, five vocal and three environmental sounds), respectively. The modeling procedures referred to the work of developing Zwicker’s psychological annoyance model[9][10]. The form of the model is described as: 𝑦 = 𝑁(1 + √𝛾0 + 𝛾1 𝑥12 + 𝛾2 𝑥22 + ⋯ + 𝛾𝑛 𝑥𝑛2 ), (1) where the dependent variable, y, is the preference scores, the independent variables, 𝑥1 –𝑥𝑛 are the scores of each dimension, N and 𝛾0 –𝛾𝑛 are coefficients. The preference scores are normalized using the Min–Max scaling[11][12] as: 𝑝𝑟𝑒𝑓𝑒𝑟𝑒𝑛𝑐𝑒−𝑝𝑟𝑒𝑓𝑒𝑟𝑒𝑛𝑐𝑒𝑚𝑖𝑛 . Normalized Preference = 𝑝𝑟𝑒𝑓𝑒𝑟𝑒𝑛𝑐𝑒 −𝑝𝑟𝑒𝑓𝑒𝑟𝑒𝑛𝑐𝑒 . (2) 𝑚𝑎𝑥 𝑚𝑖𝑛 The range of preference is 0 to 1 after normalization. For each group of stimuli, we set scores of dimensions dark/light, warm/cold, location accuracy, stability, spatial impression, sound balance, high transparency, medium transparency, and low transparency to 𝑥1 , 𝑥2 , 𝑥3 , 𝑥4 𝑥5 , 𝑥6 , 𝑥7 , 𝑥8 , 𝑥9 , respectively. Then we can use Eq. (3) to get the coefficients, 𝛾1 –𝛾9 , for each dimension: 𝑦 = 𝑘𝑥 𝛾 . (3) Finally, we formulate the models by nonlinear curve fitting according to Eq. (1). The equations are shown below, noting that only five dimensions relatively high-correlated to preference were selected for part Ⅰ, i.e. the instrumental music. 𝑦 = 0.00292(1 + √−9577 + 6.17𝑥32 − 3.12𝑥42 + 2.33𝑥52 − 2.94𝑥62 + 2.96𝑥92 ) (4) 2 2 2 2 𝑦 = 0.00302(1 + √31.6 + 1.68𝑥1 − 1.10𝑥2 + 1.17𝑥5 + 1.01𝑥8 ) (5) 2 2 2 𝑦 = 0.00352(1 + √−19320 + 2.86𝑥3 + 2.86𝑥4 + 4.42𝑥6 ) (6) The goodness of fit for each equation is 0.2677, 0.1406, and 0.7621, respectively. The fitting correlations of groups 1 and 2 are relatively low, probably because the scores are strongly sound-dependent and the data have been merged by combining all stimuli data. 4. Conclusion There are significant differences in subjective preference between two devices with similar objective parameters. The dimensions in timbre, location accuracy, sound balance, and medium transparency significantly impacted the subjective preference. The influence of multi-dimension scores is also sound dependent. The nonlinear regression models account for the relation between preference and various multi-dimension scores for music, vocal and environmental sounds. The model can estimate the The 28th International Congress on Sound and Vibration (ICSV28), 24-28 July 2022 6 perefernce of environmental sounds accurately with a high regression coefficient. The traditional curve fitting methods are limited, future work would consider the machine learning for modelling the accurate preference with multi-dimension auditory quality. 5. Reference [1]. [Online] DIGITAL 2022: ANOTHER YEAR OF BUMPER GROWTH - We Are Social UK [2]. V. Philip, L. M. Stewart. The GoPro gaze[J]. Cultural Geographies, 2014, 24(1). [3]. F. E. Toole. Loudspeaker measurements and their relationship to listener preference[J]. Journal of The Audio Engineering Society, 1974, 22:402–415. [4]. Recommendation ITU-R BS.1284-2[M], General methods for the subjective assessment of sound quality. Geneva: Electronic Publication, 2019. [5]. A. Furmann, H. Edward, M. Niewiarowicz, P. Perz. On the correlation between the subjective evaluation of sound and the objective evaluation of acoustic parameters for a selected source. The Journal of The Acoustical Society of America. 1990, 38(11): 837–844, 1990. [6]. David Clark. Precision measurement of loudspeaker parameters. Journal of The Audio Engineering Society, 1996, 100: 1777–1786. [7]. H. A. Davi. The Method of Paired Comparisons (2nd Edition). Oxford University Press, New York, NY, USA, 1999. [8]. S. Siegel and N.J. Castellan. Nonparametric Statistics for the Behavioural Sciences (2nd Edition). Mcgraw Hill Higher Education, New York, NY, USA. 1988. [9]. Zwicker, E. and H. Fastl. 2007. Psychoacoustics Facts and Models. 2nd ed. Springer, Berlin. [10]. More, S. Aircraft noise metrics and characteristics. Dissertation, Purdue University, West Lafayette, IN, USA, 2011. [11]. F.A. Wichmann, N.J. Hill. The psychometric function: I. Fitting, sampling, and goodness of fit, Perception & Psychophysics, 2001, 63: 1293–1313. [12]. [Online] MinMaxScaling: Min-max scaling fpr pandas DataFrames and NumPy arrays - mlxtend (rasbt.github.io) The 28th International Congress on Sound and Vibration (ICSV28), 24-28 July 2022 7