INTERNATIONAL TELECOMMUNICATION UNION TELECOMMUNICATION STANDARDIZATION SECTOR STUDY PERIOD 2009-2012 COM 12 – C 13 – E February 2009 English only Original: English Question(s): 8, 13, 15/12 STUDY GROUP 12 – CONTRIBUTION 13 Source: France Telecom Title: Comments on test vectors generation in ITU-T Recommendation P.564 Abstract The need for parametrical models of voice quality evaluation is undeniable. They are commonly applied in telecommunication systems wherever measurements of quality need to be done on live communications but lack of resources do not allow the use of signal-based methods. Until this day there exist few parametrical models of quality evaluation, only two of them have been standardized by ITU-T (model E [1] and Call Clarity Index [2]). In order to evaluate the performances of parametrical voice quality evaluation models which were not standardised ITU-T has produced recommendation P.564. Hardly any of the existing models have been validated; among others the E model normalised by recommendation ITU-T G.107 had not passed the validation by ITU-T P.564 [3]. We believe that one of the possible causes of this fact, besides the known weaknesses of the Emodel, lies in a partially improper test vector generation, which leads to estimation of MOS score using inappropriate values of network parameters. This is caused by imprecise generation of packet loss ratio for short test vectors. As the length of test vectors cannot be increased (it is limited by PESQ [4] method to 30 s) we postulate the revision of the network impairment profile in ITU-T P.564 which for the time being is based on network model described in ITU-T G.1050. Contact: Anna Czuczman France Telecom France Tel: +33 296 05 22 90 Fax: +33 296 05 35 30 Email: anna.czuczman@orange-ftgroup.com Attention: This is not a publication made available to the public, but an internal ITU-T Document intended only for use by the Member States of ITU, by ITU-T Sector Members and Associates, and their respective staff and collaborators in their ITU related work. It shall not be made available to, and used by, any other persons or entities without the prior written consent of ITU-T. -2COM 12 – C 13 – E 1. Introduction The recommendation ITU-T P.564 defines series of tests which allow the verification that candidate parametrical model estimate well voice quality of VoIP service. The validation of model conformity is composed of several phases which are described by recommendation: ■ Choice of test vectors – ITU-T P.564 recommends the usage of 4 speech files : frf1, spm1, smp2, ukf1 ■ Generation of degraded test vectors ■ Evaluation of conformity Test vectors are generated for various network scenarios, while applying several values of network parameters. Network parameters as well as scenarios are both described in recommendation ITU-T G.1050. This recommendation also defines the network architecture and the means of simulating network parameters. Before conducting the actual test vector generation as indicated by ITU-T P.564, pre-tests should be performed. The objective of pre-tests is to determine the appropriate values of parameters in order to obtain MOS-LQ (i.e. we are restricted to a listening-only context) scores which are within each of 5 quality ranges. The evaluation of conformity according to ITU-T P.564 is based on comparison of results of voice quality estimation of candidate parametrical model (MOS-LQE) with reference MOS-LQO score measured by PESQ. As the number of tests which need to be conducted is quite significant it has been chosen to use as reference MOS score the results of objective measuring method and not of subjective tests. Source Device A Local Access A LAN A LAN data rate LAN occupancy Local Access B Core IP Network Access data rates Access occupancy MTU size Route flapping Link failure One-way delay Jitter Packet loss Reordered packets LAN B Access data rates Access occupancy MTU size Destination Device B LAN data rate LAN occupancy Figure 1: G.1050 - IP Network Impairment model While conducting validation according to ITU-T P.564 we have observed that MOS scores calculated by PESQ differed significantly (1-2 point on MOS scale) for the same set of parameters values (comparison of results of pre-tests and actual tests). After close investigation we believe that it stems from the fact that for small files the generated packet loss ratio differs significantly form the demanded packet loss ratio. As a result, even though the demanded model parameters used to generate test vectors were exactly identical, the resulting packet loss ratios were not the same, and thus neither were MOS scores for those files. The second chapter of this contribution explains the theoretical foundations of this issue. The following chapters describe its implications for recommendations ITU-T G.1050 (chapter 3) and ITU-T P.564 (chapter 4). The last chapter concludes this document and presents two propositions of approaching the problem. ITU-T\COM-T\COM12\C\13E.DOC -3COM 12 – C 13 – E 2. Desired packet loss ratio The simplest algorithm of packet loss generation is based on one parameter: packet loss ratio Ppl. Ppl describes the percentage of packets lost of the test vector of N packet length. The algorithm is presented hereafter. Let N be equal to number of packets for n=1:N do generate random number p from 0 to 1 if p < Ppl loss(n) = TRUE else loss(n) = FALSE endif end The number of packets actually lost tends to Ppl if N is big enough (N > 5000001). However, the length of test vectors in ITU-T P.564 validation is quite small. In fact it cannot exceed 30 s as PESQ (remember: this as been chosen in P.564 as reference method) does not allow the usage of longer files. Therefore maximum number of packets is 1500 for a packet size of 20 ms. This maximum length of test vectors is much too small and as a result the observed (applied) packet loss ratio (Ppm) may differ significantly from Ppl. This is proven in the following chapters. Theoretically the probability of loosing n packets when applying the above algorithm with a defined Ppl is equal to: N N n Ploose n out of N packets Ppl n 1 Ppl n Let the measured probability Ppm be defined as the number of lost packets n divided by the total number of packets in test vector (N). Using the above formula we can obtain the relation between measured (Ppm) and demanded (Ppl) packet loss ratio. Figure 2 illustrates the probability density function of measured packet loss ratio (Ppm) knowing that the demanded packet loss ratio was equal to Ppl. Figure 2: Probability density function of actually obtained packet loss ratio for a given test vector 1 For Ppl = 2%, probability of 98,76 % that measured packet lost ratio will be equal to 2 % ± 0,05 % ITU-T\COM-T\COM12\C\13E.DOC -4COM 12 – C 13 – E As it can be observed on Figure 2 the standard deviation of probability density function is quite significant. For example, for Ppl equal to 2 % there is a no zero probability that the observed packet loss ratio (Ppm) will be twice smaller or 50 % higher! Practical experiments confirm theoretic assumptions. Packet loss generation algorithm presented above was employed to generate packet loss vectors. 10000 tests were performed for demanded loss ratios (Ppl) of 2 %, 5 % and 10 %. Figure 3: Experimentally observed probability density function of packet loss ratio generation Therefore, when generating packet loss on files which are too small, one has to be aware that the obtained packet loss ratio can differ from demanded. In addition, these effects are visible even for maximum length of test vector allowed by PESQ (30 s). 3. Implication for G.1050 The model defined by ITU-T G.1050 uses a more complicated model of packet loss (Gilbert-Elliot) but the algorithm of loss application is basically the same as the one presented above in this document (loss is applied packet per packet). As the model is using several parameters to generate loss sequence (loss and transition probabilities), each of these parameters is subject to effects described in previous chapter. In order to simulate real network, G.1050 takes into consideration several parts of network (home, access and core). In each part of the network packet loss and jitter are applied according to mechanisms proper to that part of network. However, it also means that packet loss is applied several times for every part of the network and that each time the actual packet ratio applied may differ from the demanded ratio. Therefore the resulting packet loss ratio at the egress of the model may differ significantly from the ratio which could have been calculated theoretically, and (more embarrassing) between two consecutive applications of the same simulation scenario. This fact in itself is not bothering when using test vectors of significant length (the theoretical and obtained packet loss ratios are similar); however it has important implications when used for validation with P.564. ITU-T\COM-T\COM12\C\13E.DOC -5COM 12 – C 13 – E 4. Implications for P.564 Most of the network parameters described by ITU-T G.1050 which are taken into consideration by ITU-T P.564 have direct influence on packet loss ratio applied on test vectors: LAN or access occupancy, Out Of Sequence Packets and Core loss percentage. In addition, in several test scenarios (1A, 3C, 5C, 6B, 7B) these parameters are used at the same time. In addition, ITU-T P.564 recommends the usage of speech files of 8 seconds. It means that for typical packet size of 20 ms the test vectors have length of 400 packets. This is yet smaller that the length for which the analysis described in chapter 2 was done. As described in previous chapters, when simulating packet loss on small test vectors, the obtained packet loss ratio differs from the demanded ratio. Yet, if several parameters are used to apply packet loss (V1 vs. V2 etc. as defined by ITU-T P.564) it is difficult to determine the value of each parameter which was actually used to apply packet loss. Generally, the demanded values are considered as those which were actually applied, but as it was described in previous chapter this is not necessarily true. Therefore the obtained MOS scores may in fact correspond to different nominal values of parameters (V1, V2, V3, V4 or F1) than those to which they were attributed. As a result, it may happen that MOS scores measured during different tests for the same condition differ from each other, even though the measurements were performed on test vectors which were generated with the same set of network parameters. It is also highly probable that some vectors fall outside the conditions for which they had been meant. But all in all, we can’t really say that this gives a global bias to the test plan of P.564. We believe that this may be one of the causes of the fact that the E model was not validated according to P.564. 5. Conclusions P.564 describes an excellent procedure of validating candidate parametrical models of voice quality evaluation. Yet, the fact that it uses PESQ method for reference MOS measuring results in the limitation of test vector length to 30 s which for standard payloads of 20 ms corresponds to 1500 packets. On the other hand, the network model ITU-T G.1050 used to generate test vectors has an important drawback when small test vectors are used. While it is clear that no other reference method can replace PESQ for the time being in P.564, it is necessary to discuss the method of generation of test vectors. Several possibilities may be discussed. Firstly, if it is decided to continue with usage of ITU-T G.1050 network model, it is necessary to change slightly the procedure of test vector generation. For example, it is possible to generate several test vectors for one set of parameters and then calculate the mean of observed packet loss ratio of these test vectors. MOS score (for both PESQ and the parametric model under validation) should only be calculated for test vectors for which the observed packet loss ratio is closest to mean value. The number of test vectors which need to be generated should be defined empirically. A second option could consist in usage of a different network model (or an enhanced G.1050) for test vector generation. This model should employ a different packet loss algorithm which would be less subject to randomness for small test vectors. Nevertheless, the applied model should simulate real network conditions. It is necessary that Q 15/12 considered this issue as it may be an important cause of non-conformity of candidate parametrical models, which otherwise could have been proven valid. References ITU-T\COM-T\COM12\C\13E.DOC -6COM 12 – C 13 – E [1] ITU-T recommendation G.107 (2008) “The E-model: a computational model for use in transmission planning” [2] ITU-T recommendation P.562 (2004) “Analysis and interpretation of INMD voice-service measurements” [3] ITU-T contribution COM 12 – C 100 – E (2007) “E-model P.564 Compliance Testing and Emodel evolution proposals” [4] ITU-T recommendation P.862 (2001) “Perceptual evaluation of speech quality (PESQ): An objective method for end-to-end speech quality assessment of narrow-band telephone networks and speech codecs” _____________ ITU-T\COM-T\COM12\C\13E.DOC