File: testset.doc, updated 10/11/90 TIMIT Suggested Training/Test Subdivision The texts and speakers in TIMIT have been subdivided into suggested training and test sets using the following criteria: 1- Roughly 20 to 30% of the corpus should be used for testing purposes, leaving the remaining 70 to 80% for training. 2- No speaker should appear in both the training and testing portions. 3- All the dialect regions should be represented in both subsets, with at least 1 male and 1 female speaker from each dialect. 4- The amount of overlap of text material in the two subsets should be minimized; if possible no texts should be identical. 5- All the phonemes should be covered in the test material, preferably each phoneme should occur multiple times in different contexts. NOTE: THIS TEXT SUBDIVISION HAS NO CORRESPONDENCE WITH THE ORIGINAL TRAINING MATERIAL DISTRIBUTED ON THE PROTOTYPE CD-ROM. Use only the designated training material on this disc for training purposes. Core Test Set ------------Using the above criteria, 2 male speakers and 1 female speaker from each dialect were selected, providing a core test set of 24 speakers. Each speaker read a different set of 5 SX sentences. Since each SI sentence was read by only one speaker, these texts did not impose constraints in selecting the texts or speakers. The selected texts were checked to ensure that the set included at least one occurrence of each phoneme. The phonemic analysis was based on concatenated phonemic transcriptions of the words in the sentence, not the actual, realized phonetic transcription. Thus, the phonetic allophones found in the test data may be expected to differ from the underlying phonemic forms in accordance with typical phonological variations. The core test set thus contains 192 different texts ((5 SX + 3 SI sentences) x 24 speakers). To avoid overlap with the training material the 2 SA sentences have been excluded from the core and complete test sets. THESE SENTENCES ARE INCLUDED ON THE CD-ROM, BUT SHOULD NOT BE USED FOR TRAINING OR TEST PURPOSES. Table 1 lists the speakers in the core test set for each dialect. set is the minimum recommended set for test purposes. This Table 1: Speakers in the Core Test Set Dialect ------1 2 3 4 5 6 7 8 ----Total Male --------DAB0, WBT0 TAS1, WEW0 JMP0, LNT0 LLL0, TLS0 BPM0, KLT0 CMJ0, JDH0 GRT0, NJM0 JLN0, PAM0 ---------16 Female #Texts/Speaker ------------------ELC0 8 PAS0 8 PKT0 8 JLM0 8 NLP0 8 MGD0 8 DHC0 8 MLD0 8 ------8 #Total Texts -----------24 24 24 24 24 24 24 24 -----192 Complete Test Set ----------------The complete test set is obtained by including all of the speakers that said any of the texts read by any speaker in the core test set. In doing so, it is insured that no sentence text appears in both the training and test material. Thus, since each SX text was read by a total of 7 speakers, an additional 6 speakers for each text are included, giving a total of 168 speakers. The 168 speakers represent 27% of the total number of speakers. The resulting dialect distribution of the 168 speaker test set is given in Table 2. Table 2: Dialect Distribution of Speakers in Complete Test Set Dialect ------1 2 3 4 5 6 7 8 ------Total #Male ----7 18 23 16 17 8 15 8 ----112 #Female ------4 8 3 16 11 3 8 3 ------56 Total ----11 26 26 32 28 11 23 11 -----168 The complete test set consists of a total of 1344 sentences, 8 sentences from each of the 168 speakers. In this set there are 120 distinct SX texts and 504 different SI texts. Thus, roughly 27% (624) of the texts have been reserved for the test material. The minimum recommended test material is the core test set, which consists of 2 male speakers and 1 female speaker from each dialect and 192 unique texts. Those wishing to perform more extensive testing, should use the complete test set consisting of a total of 1344 sentences, from 168 speakers.