HOST LABORATORY ROLE IN THE SELECTION OF THE FUTURE NATO NARROW BAND VOICE CODER NATO C3 Agency Den Haag, 2597AK, The Netherlands Voice@nc3a.info ABSTRACT This paper describes the role and responsibilities of the host laboratory in the multi-national test and selection process for the future NATO narrow band voice coder standard. The selection was made from a number of implementations of narrow band voice coders submitted by NATO member nations. Voice coders were installed on a voice processing workstation at the host laboratory in fixed and floating point forms, together with a number of reference coders. The voice coders were then comprehensively tested in a wide range of noise environments and conditions which were representative of military scenarios. Over 500 hours of processed speech was generated and analysed during these tests. 1. INTRODUCTION A voice coder for use within NATO will necessarily be used in a wide range of environments, by a wide range of users. The voice coder performance in the presence of background acoustic noise, and effective operation with multiple languages and use by non-native speakers are areas of particular interest for a NATO voice coder. The process was a multi-national effort. The voice coders submitted for testing were each sponsored by a different NATO nation. The three test laboratories were each based in different countries. The NATO Ad-Hoc Working Group on Narrow Band Voice Coding (AHWG NBVC) which prepared the test plan and monitored the test process consists of experts from many further nations. The NATO C3 Agency acted as impartial host laboratory for the testing. In order to make the most accurate selection of voice coder within such disparate working environments, the NATO Ad-Hoc Working Group in Narrow Band Voice Coding (AHWG NBVC) developed a comprehensive test regime [1]. The tests were conducted in a number of languages, including the two official languages of NATO – English and French. The selection process combined tests of speech quality, intelligibility, speaker recognition and language dependency. These were tested with a variety of different methods used by the test laboratories involved. Input material was provided to the host laboratory by the test laboratories. The host laboratory then processed the test data through the voice coders, and performed any required pre- and post-processing. In addition, the host laboratory also imposed a blind on all processed data to ensure impartial analysis of the data. 2. PROCESSING OF TEST DATA 2.1. Voice Coding Workstation In order to process data with software versions of the coders the host laboratory installed a voice processing workstation, based on a Sun Ultra 60 computer. The computation required for the state-of-the-art coders voice coders being tested, coupled with the large amounts of data to be processed required a workstation of substantial processing power to be employed. Phase one required executable code only, although some coders were provided as ‘C’ source code and compiled in-situ. All digital speech input data was provided as standard raw audio files (audio sampled at 8kHz, 16 bits per sample, and 2’s complement arithmetic). The large amounts of data involved – over 35 GB in phase two - necessitated the use of DDS3 tapes for handling data. For reasons of confidentiality, the workstation was not connected to any network at any point during the tests. 2.2. Input Test Data Data was provided to the host laboratory by test laboratories on DDS3 tape. The directory tree, shown in figure 1, was used to separate test material for different tests. All input data was provided in files with the name and directory location given by: / Phase/ TestLab / testY / noiseZZ / Xllmm.ext Each test laboratory only provided the data for one directory at the “test lab” stage. All lower levels of the tree were populated by each test laboratory, although not tests were performed by all laboratories and not all twelve noise conditions were required for every test. 3. HOST LABORATORY PROCESSING 3.1. Phase One Testing Phase one of the selection process assessed the performance of the candidate voice coders for both the intelligibility and voice quality in benign acoustic conditions. Non real-time floating point C software implementations of candidate vocoders were used. The candidates operated on digital speech provided as standard raw audio files. The candidate voice encoders read a raw audio file and wrote a packed digital bit stream (to file) at either 1.2Kbps or 2.4Kbps. The voice decoders then read the packed digital bit stream and generated corresponding synthetic output speech (also written to file, in raw audio format) [2]. In Test phase /phase1 Test lab Test type /Arcon /test1 /test2 /test3 /test4 /phase2 /CELAR /TNO /test5 One raw audio file LPC10e CVSD CELP FR1200 FR2400 TU1200 TU2400 US1200 US2400 B I T S T R E A M LPC10e CVSD CELP FR1200 FR2400 TU1200 TU2400 US1200 US2400 MNRU 5db MNRU 10dB MNRU 15dB Noise Condition /noise01 /noise02 ….. /noise12 MNRU 20dB MNRU 25dB Files for test Xllmm.ext Xllmm.ext X = gender m/f ll = speaker identifier mm = phrase number ext = data type .raw (raw audio) or .08K (8kHz sampled data) Figure 1: Input data directory and file structure 1: Processing input data through voice coders with The useFigure of three national test laboratories, each a specialist different test methods and languages ensured that the candidate coders for STANAG 4591 were exposed to a rigorous and varied testing process. In phase one, all coders were assessed in a limited number of noise environments e.g. quiet; modern office; and non-stationary speech-shaped noise (at 12 dB and 6 dB signal to noise ratios). In general, each speech file from the test laboratories was processed through nine voice coders – three candidate vocoders, each at two Encode Decode LPC10e LPC10e CVSD CELP Single raw audio file FR1200 FR2400 TU1200 TU2400 B I T S T R E A M CVSD CELP FR1200 FR2400 TU1200 Nine raw audio output files Sent to test labs for analysis TU2400 US1200 US1200 US2400 US2400 MNRU 30dB Xllmm.ext Figure 2: Processing of input data through voice coders data rates, plus three reference coders. For calibration of the Mean Opinion Score (MOS) tests for both the US and NL test labs, the speech material was also passed through Modulated Noise Reference Unit (MNRU) software from the International Telecommunications Union (ITU). This adds known levels of noise to speech signals. For these tests eight MNRU levels were used - 5 dB, 10 dB, 15 dB, 20 dB, 25 dB, 30 dB, 35 dB and 40 dB SNR. In phase one the voice coders were tested for speech quality using the Mean Opinion Score (MOS) test which was conducted by two test laboratories in multiple languages. Phase one intelligibility tests were performed by all three test laboratories but with different tests. The Consonant-Vowel-Consonant (CVC) test, the Diagnostic Rhyme Test (DRT) and the Intelligence Transmission (IntelTrans) test were used. MNRU 35dB Nine raw audio output files from codecs 17 raw audio output files. MNRU files to test labs as references for analysing speech quality MNRU 40dB Figure 3: Input data and MNRU processing 3.2. Pre- and Post-Processing The different types of tests employed within the test plan meant that processing for input and output data varied. Some tests required only short input stimuli. In order to remove any edge effects during the start and end phases of the voice coder operation short input files were concatenated prior to processing. Concatenation was performed using the ITU Software Toolkit (STL2000). After processing concatenated data through the voice coders the output file was cut using the astrip command in the Software Toolkit. An additional (dummy) file inserted at the start of the concatenated input was discarded from the stripped output. Preand post-processing was required in both phase one and phase two. 3.3. Phase Two Testing Phase two of the selection process extended the range of acoustic noise environments and the speech characteristics tested. In addition, phase two candidate coders were restricted to fixed point C source code versions as would be used in practical implementations in voice communication terminals. In phase two, the coders were once again installed on the NC3A speech processing workstation. However, executable code for all coders was compiled on the workstation from source code. This, coupled with restrictions on the ‘C’ libraries available during compilation, allowed verification that the coders used only fixed-point operations. Phase two repeated all of the phase one tests for intelligibility and speech quality, but also performed these tests over a wider range of acoustic noise conditions. These additional acoustic noise conditions were both harsher and more representative of the worse case NATO operational scenarios. The harsh acoustic noise environments selected were the Mobile Command Enclosure (MCE) field shelter, staff car, wheeled military vehicle (HMMWV and P4), helicopter (UH60 Black Hawk), tracked vehicle (M2A2 Bradley fighting vehicle and LeClerc tank) and supersonic aircraft (F-15 and Mirage 2000). Voice coder ‘tandem’ and channel error conditions are two practical environments in which military voice coders operate. These cases were tested in phase two for both speech quality and intelligibility. The random bit error channel test uses a 1% random bit error pattern applied to the digital bit stream files. In the case of the ‘tandem’ condition, the speech passed through two complete voice coding algorithms. First, the 16Kbps CVSD algorithm followed by the candidate algorithms. The two voice coders in the tandem test are complete in that they each take speech in, write out digital bit streams (to files), and produce output speech. laboratory for appropriate evaluation. It should be noted that the test laboratories had no information pertaining to the identity of the voice coding algorithms which they evaluated. 1% random bit errors Audio input file Bitstream Coder n Decoder n Audio output file Out Test phase /phase1 /phase2 Test configuration: 1% Bit error rate Test lab Audio input file CVSD Coder B i t s CVSD Decoder A u d i o B i t s Coder n Decoder n Audio output file Test type Noise Condition /Arcon /test1 /test2 /CELAR /test3 /noise01 /noise02 ….. /TNO /test4 /test5 /noise12 Test configuration: Voice coder tandem Coder Figure 4: Phase two additional test configurations Additional types of tests were conducted in phase two. These were speaker recognisability and language dependency tests. Coders were also tested for performance with whispered speech. The whispered speech test uses the Dutch (TNO) Speech Reception Threshold (SRT) intelligibility test for evaluation. 4. BLINDING PROCESS To mask the identities of all voice coders and guarantee impartiality during the analysis of individual voice coders, a blinding process was applied by the host laboratory to all processed material before it was sent for analysis by the test laboratories. Decoded output files Single blinded files Double blinded files LPC10e Coder1 Vocoder1 CVSD Coder2 Vocoder2 CELP Nine audio output files FR1200 FR2400 B Coder3 B L L Vocoder4 I Coder5 I Vocoder5 TU1200 N Coder6 N Vocoder6 TU2400 D Coder7 D Vocoder7 US1200 Coder8 US2400 Coder9 Blinded by NC3A To test lab Vocoder8 Vocoder9 Blinded by DSTL Figure 5: Double blinding process 4.1. Files for test /coder2 Xllmm.out …….. Xllmm.out /coder9 Xllmm.out Figure 6: Output directory structure In phase one a single blind was applied by the host laboratory to the data, prior to evaluation by the test laboratories. In phase two, for added integrity, a double blind was carried out where the NC3A host laboratory performed the initial blinding, with a second blinding operation carried out by an impartial member of the NBVC AHWG. The double blind was carried out in isolation and provided a static re-labeling of all ‘coder n’ to ‘vocoder m’. The result is that neither the NC3A personnel nor the impartial NBVC representative responsible for the second blinding operation are aware of the identities of the nine coders. 5. CONCLUSIONS Vocoder3 Coder4 /coder1 Output File Structure All processed files were stored in a directory structure similar to that of the input material. The only difference was the addition of an extra level in the directory tree which was used to separate the output from each of the voice coders, shown in figure 6. The output from each of the nine coders (three candidates, each at two bit rates, plus three reference coders) is randomly re-labelled as ‘coder n’ where n = 1 to 9. The output material from each of the three ‘test lab’ directories is then sent to the respective test In this paper we have outlined the role of the host laboratory during the extensive test and evaluation process to select the new NATO narrow Band Voice Coder. This voice coder is being standardised as NATO Standardisation Agreement (STANAG) 4591 [3]. 6. AKNOWLEDGEMENTS The author would like to thank all those involved in this work, at the test laboratories, the codec developers and sponsors and all members of the NATO Ad Hoc Working Group on Narrow Band Voice Coders for their assistance during this work. 7. REFERENCES 1. Tardelli, J et al. “Test and selection plan 1200/2400 bps speech coder”, NATO AHWG NBVC, May 2000. 2. “Future NATO Narrow Band Voice Coder Selection: Stanag 4591,” NC3A Technical Note 881, Den Haag, The Netherlands, 2002. 3. NC3A web site. http://nc3a.info/Voice.