"From ITU-T Workshop on Speech to Audio: bandwidth extension, binaural perception" Lannion, France, 10-12 September 2008 Towards Consistent Assessment of Audio Quality of Systems with Different Available Bandwidth Yu Jiao Sławek Zieliński Francis Rumsey Institute of Sound Recording, University of Surrey Lannion, France, 10-12 September 2008 International Telecommunication Union Outline New challenges in quality assessment Attributes to assess Standard scale Summary Demo I: an example of attributes identification and assessment Demo II: assessing envelopment, an example of direct anchoring Lannion, France, 10-12 September 2008 International Telecommunication Union 2 Audio Where should I go? Speech New Challenges in Speech and Audio Quality Assessment Is there a need for the development of a new, more universal standard for audio quality assessment, regardless of application or bandwidth? Universal Method? Lannion, France, 10-12 September 2008 International Telecommunication Union 4 Towards a Universal Method Two challenges: Which perceptual attributes to use? How to calibrate the scale so that the results obtained from assessment of audio quality in different applications can be sensibly compared? Lannion, France, 10-12 September 2008 International Telecommunication Union 5 Outline New challenges in quality assessment Attributes to assess Standard scale Summary Demo I: an example of attributes identification and assessment Demo II: assessing envelopment, an example of direct anchoring Lannion, France, 10-12 September 2008 International Telecommunication Union 6 Which Attributes to Use? Most commonly used attributes: Speech Quality (ACR, DCR, CCR) – ITU-T Recommendations Basic Audio Quality (Continuous scale), Stereophonic Image Quality, Front Image Quality, Impression of Surround Quality – ITU-R Recommendations What about other attributes? (see next slide) Lannion, France, 10-12 September 2008 International Telecommunication Union 7 ? Preference Basic Audio Quality Timbral Quality Spatial Quality Preferences Which Attributes to Use? Judgements Front Image Accuracy Timbral Accuracy Spatial Accuracy Localisation error Dynamics Loudness Envelopment Distortions Noise presence Width change Distance change This set is not exhaustive. Which Attributes to Use? It is relatively easy to agree upon and standardise high level attributes It is more difficult to standardise low-level attributes Usefulness of low-level attributes is application specific However, a pool of standardised attributes, together with associated anchors and scale system, may be of help. Lannion, France, 10-12 September 2008 International Telecommunication Union 9 Example of Spatial Attributes Pool – Rumsey(2002) One of the most systematic attribute pools for spatial audio assessment Lannion, France, 10-12 September 2008 International Telecommunication Union 10 Outline New challenges in quality assessment Attributes to assess Standard scale Summary Demo I: an example of attributes identification and assessment Demo II: assessing envelopment, an example of direct anchoring Lannion, France, 10-12 September 2008 International Telecommunication Union 11 Do we need a standard audio quality scale? It may help to reduce bias in listening tests Essential for calibration of objective models It may help to compare results across different applications with different available bandwidth Contraction bias Range equalising bias Centring bias Stimulus spacing bias etc. Poulton (1989) Lannion, France, 10-12 September 2008 International Telecommunication Union 12 Zielinski, Rumsey and Bech (2008) Range Equalising Bias e.g. Wide Band Stimuli Lannion, France, 10-12 September 2008 e.g. Narrow Band Stimuli International Telecommunication Union 13 Range Equalising Bias = “Rubber Ruler Effect” e.g. Full Band Stimuli e.g. Narrow Band Stimuli e.g. Wide Band Stimuli Range Equalising Bias – Data taken from Zielinski et al (2003 and 2005) • Means and 95% CIs • Systematic upward shift • Max diff ≈ 20 • Absolute scores? Lannion, France, 10-12 September 2008 Conclusion: Do not put confidence in the labels International Telecommunication Union 15 Range Equalising Bias – Zielinski et al (2007) • Means and 95% CIs • Systematic upward shift • Max diff = 13 • Absolute scores? Lannion, France, 10-12 September 2008 International Telecommunication Union 16 Range Equalising Bias – Cheer (2008) Excellent Good Fair Poor Bad Test based on ITU-T P.800 Questions: • Is “Absolute Category Rating” (ACR) really absolute? • Do we need a better calibrated scale? Lannion, France, 10-12 September 2008 International Telecommunication Union 17 Towards Consistent Assessment across Narrow-, Wide- and Full-Band Applications Use a standard scale: If possible, use physical units that are familiar to the listeners (Poulton, 1989) • For example, width of frontal image can be assessed as an angle expressed in degrees • Distance between listener and apparent source could be assessed in metres • Use an open-ended ratio scale – biased (Narens, 1996) • Use verbal anchors along the scale (ineffective – see the previous slide) • Use auditory anchors (effective but difficult to implement) Lannion, France, 10-12 September 2008 International Telecommunication Union 18 Three Types of Auditory Anchors: 1. Direct Anchors Listeners are instructed how to use the scale relative to two or more auditory anchors • Help to define a “frame of reference” • Examples of partial use: ITU-R BS.1116 and ITU-R BS.1534 MUSHRA 2. Indirect Anchors Anchors are included in the set of stimuli under assessment. Listeners are not instructed how to assess them. They are unaware of their purpose. • Effective bias diagnostic tool 3. Background Anchors Anchors are only presented during familiarisation phase prior to a listening test but they are not included in the proper listening test. Listeners are not instructed as to how these anchors relate to the scale. • May help to define a “frame of reference” if used properly • They do not calibrate the scale but “calibrate” listeners • Examples of use: ITU-T P.800 (MNRU reference quality impairments). Also 3.5kHz anchor in ITU-R BS.1534 MUSHRA • Used very rarely Example of a Scale with Direct Anchors (A & B) – Conetta (2008) Only two anchors in this example but more than 2 can also be used. Lannion, France, 10-12 September 2008 International Telecommunication Union 20 Challenges of Direct Anchoring • How to choose them in terms of levels of quality? • How to make them similar to stimuli under assessment in terms of perceptual properties? Anchor 1: Full bandwidth full surround sound quality? Anchor 2: CD quality? Hypothetical Standard Quality Scale Anchor 3: FM radio quality? Anchor 4: Narrow-bandwidth telephone quality? Lannion, France, 10-12 September 2008 Anchor 5: Cascaded codecs? DropInternational outs? Modulation Noise? Telecommunication Hard Union 21 clipping? etc. Example of Diagnostics with Indirect Anchors • Indirect anchors – useful diagnostic tool to check for bias • Scores “float” along the scale • Do not put confidence in labels Lannion, France, 10-12 September 2008 International Telecommunication Union 22 Outline New challenges in quality assessment Attributes to assess Standard scale Summary Demo I: an example of attributes identification and assessment Demo II: assessing envelopment, an example of direct anchoring Lannion, France, 10-12 September 2008 International Telecommunication Union 23 Summary • A need for the development of a new, more universal standard for audio quality assessment, regardless of application or bandwidth. • More attributes are needed to reveal the nature of quality degradation. • Comparing audio quality across different applications, e.g. with different audio bandwidth, is problematic due to potentially inconsistent use of a scale (ill-defined frame of reference). • Standard scale needed. • Direct anchoring technique could be used but difficult to identify suitable auditory anchors. Lannion, France, 10-12 September 2008 International Telecommunication Union 24 Demo I: An example of attributes identification & assessment -Jiao et al. (2007) New Developed Codec Attribute Identification and Selection Basic Audio Quality Timbral Distortion Spatial Distortion Dynamic Spatial Distortion (DSD) Level of DSD Dynamicity of DSD Scales & test method design Lannion, France, 10-12 September 2008 International Telecommunication Union 26 Lannion, France, 10-12 September 2008 International Telecommunication Union 27 BAQ = "0.668# TD " 0.350# LDSD " 0.179# DDSD + 86.45 ! Lannion, France, 10-12 September 2008 International Telecommunication Union 28 Demo II: Assessing envelopment, an example of direct anchoring George et al. (2008) Interface allowing assessment of envelopment arising from surround recordings. References Cheer, J. (2008) The Investigation of Potential Biases in Speech Quality Assessment. Final Year Technical Project. University of Surrey, UK. Institute of Sound Recording. Conetta, R. (2007) Scaling and predicting spatial attributes of reproduced sound using an artificial listener, MPhil/PhD Upgrade Report, University of Surrey, Institute of Sound Recording. George S. Zielinski, S., Rumsey F., Bech S.,(2008) Evaluating the Sensation of Envelopment Arising from 5-channel Surround Sound Recordings. presented at the 124th AES convention, Paper 7298,Amsterdam, the Netherlands. Jiao Y., Zielinski, S., Rumsey F., (2007) Adaptive Karhunen-Loeve Transform for Multichannel Audio. Presented at the AES 123rd Convention, Paper 7298, New York. Narens, L. (1996) A Theory of Ratio Magnitude Estimation, Journal of Mathematical Psychology, vol. 40, pp. 109-129. Poulton, E.C., (1989) Bias in Quantifying Judgments (Lawrence Erlbaum, London) . Rumsey, R. (2002) Spatial Quality Evaluation for Reproduced Sound: Terminology,Meaning,and a Scene-Based Paradigm, J. Audio Eng. Soc. Vol. 50, 9, pp. 651-666. Zielinski, S., Rumsey F., and S. Bech (2003) Effects of down-mix algorithms on quality of surround audio. J. Audio Eng. Soc. Vol. 51, 9, pp. 780-798. Zielinski, S., Rumsey F., Kassier R, and Bech S. (2005) Comparison of Basic Audio Quality and Timbral and Spatial Fidelity Changes Caused by Limitation of Bandwidth and by Down-mix Algorithms in 5.1 Surround Audio Systems. J. Audio Eng. Soc. Vol 53, 3, pp. 174-192. Zielinski, S., P. Hardisty, Ch. Hummersone, and Rumsey F. (2007) Potential Biases in MUSHRA Listening Tests. Presented at AES 123rd Convention, Paper 7179, New York, NY, USA, 5-8 October. Zielinski, S., Rumsey F., and Bech S. (2008) On Some Biases Encountered in Modern Audio Quality Listening Tests – A Review, J. Audio Eng., Vol. 56, No. 6, pp. 427-451. Acknowledgment Some of this work was carried out with the financial support of the Engineering and Physical Sciences Research Council, UK (Grant EP/C527100/1). Lannion, France, 10-12 September 2008 International Telecommunication Union 31