version 1.0 July 98 Tom Brøndsted tb@cpk.auc.dk Center for PersonKommunikation, Aalborg University, 1998 A SPE based distinctive feature composition of the CMU Label Set in the TIMIT database. IR 98-1001 Content: 0. List of Tables ................................................................................................................................... 3 1. Introduction...................................................................................................................................... 4 2. Levels of descriptions: phonetics, phonemics, underlying form, surface form. .............................. 5 3. Feature Classes ................................................................................................................................ 6 3.1. Major Class Features ................................................................................................................. 7 3.2. Vowels....................................................................................................................................... 8 3.2.1. Tense ................................................................................................................................... 8 3.2.2. Front .................................................................................................................................... 9 3.2.3. The Vowel Composition Table......................................................................................... 10 3.3. Consonants .............................................................................................................................. 10 3.3.1. The Consonantal Composition Table................................................................................ 10 4. References...................................................................................................................................... 13 A SPE based distinctive feature composition of the CMU Label Set in the TIMIT database 2 0. List of Tables Table 1: English Feature Classes in SPE and in the present report ..................................................... 7 Table 2: Major Class Features ............................................................................................................. 7 Table 3: Polyphonematic replacements ............................................................................................... 8 Table 4: Openness................................................................................................................................ 9 Table 5: Place of articulation............................................................................................................... 9 Table 6: Vocal allophone-phoneme replacements............................................................................. 10 Table 7: Distinctive Feature Composition of TIMIT Vowel Segments ............................................ 10 Table 8: Monophonematic replacements........................................................................................... 11 Table 9: Consonatal allophone-phonéme replacements .................................................................... 11 Table 10: Distinctive Feature Composition of TIMIT Consonantal Segments ................................. 12 A SPE based distinctive feature composition of the CMU Label Set in the TIMIT database 3 1. Introduction The background of this report is a number of speech research activities within Center for PersonKommunikation, Aalborg University, which attempt to start from phonological features rather than from phonemes or segments derived from phonemes (diphones, triphones, generalised triphones) as in traditional speech technology. The report provides a phonological feature analysis of the CMU label set in the TIMIT database (see TIMIT 1993) based on the standard generative theory by Chomsky & Halle, the so-called “SPE”-theory (see SPE 1968). For a number of reasons, such an analysis is not merely a matter of identifying each TIMIT-label with a corresponding segment used in the feature composition of English by Chomsky & Halle. These reasons are described in Section 2. This reports consists of two main sections, the first of which (Section 2) discusses some theoretical and practical issues, and the second (Section 3) contains the “ready-to-use” composition tables and replacement rules. Of course, the present report cannot aim at giving a general introduction to (generative) phonology. For readers unfamiliar with expressions like “surface structure”, “deep structure”, “underlying representation”, “morphophonemic”, “derivation rule” etc., we refer to E. Fischer-Jørgensens (TRENDS 1975) or similar books. Or such readers may simply proceed directly to the results presented in Section 3, Table 7 and Table 10. Note, that if the composition tables suggested in here are used for some kind of automatic processing of the TIMIT database (acoustic analysis, training of acoustic models or neural networks, acoustic recognition etc.), the TIMIT label files (the transcriptions files with the extension *.phn and/or the lexicon file timitdic.txt) must be modified as described in the “replacement rules” in section 3, Table 3, Table 6, Table 8, and Table 9. The distinctive feature composition presented in this report is, of course, not the only possible solution to the problem of analysing the TIMIT/CMU label into phonological features. The starting point of our composition can be described as follows: We wanted to use SPE as a generally accepted standard theory of phonology and with as few modifications as possible. Most notably, we have tried to utilise the Chomsky & Halle decomposition of English segments (SPE 1968, p. 176 f) as directly as possible. Finally, we have attempted to make as few changes to the TIMIT/CMU label set as possible. Hence, our starting point can be paraphrased as an attempt to “merge” TIMIT with SPE. Of course, we could have attempted to merge TIMIT with other phonological theories than SPE. The older theory of distinctive features introduced by R. Jakobsen (FUNDAMENTALS 1956) may be interesting as well, as feature classes are defined mostly auditorily (as opposed to the rather articulatory definitions in SPE). Provided that we expect feature classes to have some kind of direct counterparts on the acoustic level, an auditory approach may be more interesting than an articulatory, since it in this case seems more important to concentrate on how speech units sound than on how they are produced. On the other hand, the auditory approach of R. Jacobsen is far more abstract (closer to “form”, farther from “substance”) than SPE. For instance, R. Jakobsen sets up a A SPE based distinctive feature composition of the CMU Label Set in the TIMIT database 4 total of twelve universal inherent feature classes, whereas SPE uses twenty-two. This means, that many feature classes are used for separating equivocal pairs of sound classes (e.g. “compact/diffuse” separates open vowels from narrow ones as well as back consonants from front ones), and Jakobsen’s phonetic descriptions of the separate feature classes are not always transparent (cf. TRENDS 1975 p. 156). Also, newer phonological theories can be taken into consideration. However, as these theories rarely contribute directly to the theory of distinctive features, most of them are irrelevant to the task defined in the present report. Some “non-linear” phonological theories advocate a hierarchical description of feature bundles based on examination of phonological phenomena (mainly alternations but also articulatory constraints, cf. CLEMENTS 1985). These theories are opposed to SPE where features are organised in 1-dimensional sets. However, the idea that an inventory of expression segments can be described in terms of a hierarchical tree structure where upper nodes represent major class features (like +/- vocalic, +/- consonantal) and lower nodes cavity features, manner of articulation etc., and terminal nodes are phonemes is not specific to non-linear phonology. We consider this a misunderstanding by Bitar & Espy-Wilson (BITAR 1995 p. 311). Note that the decomposition tables in Table 7 and Table 10 “mechanically” can be represented as a hierarchical tree when nodes are built top-down using the feature classes in the order given in the left column. 2. Levels of descriptions: phonetics, phonemics, underlying form, surface form. The TIMIT-database consists mainly of a number of speech waveform files (*.wav) each of which is associated with an orthographic transcription (*.txt), a time-aligned word transcription (*.wrd), and a time-aligned phonetic transcription (*.phn). All words in the time-aligned word transcription files are listed in the lexicon (timitdic.txt), and each entry in the lexicon consists of an orthographic representation, in case of phonetic ambiguity a syntactic classification (e.g. “live” which dependent on word class is pronounced either [O$,Y] or [O,Y]) and a phonetic/phonemic transcription. Both the phonetic *.phn transcription files and the transcriptions of the lexicon are based on the set of CMU labels, each of which is a simple ASCII-representation of an IPA-symbol (eventually augmented with diacritic markers) like CMU /NG/= IPA [ ] or CMU /ENG/ = IPA [ ] etc. However, there is no 1:1 relation between the transcriptions of the lexicon and the *.phn-transcription files. The transcription files represent a narrow phonetic level where label-decisions have been “based on careful listening of portions of the speech waveform, as well as visual examination using displays such as the spectrogram and the original waveform” (TIMIT p. 40). The lexicon-transcriptions represent a kind of phonemic level where labels are taken from a smaller set of symbols (denoting “phonemes”; we use the term “phonemics” in the Post-Bloomfieldian sense). In concrete terms, there are the following differences between lexical transcriptions and speech transcriptions: 1 1 • The symbols of the lexical transcriptions are taken from a smaller set of in total 48 symbols (46 “phonemes” and two “prosodemes”), whereas the speech transcriptions also include two nonspeech symbols (pause, epithentic silence), six symbols denoting the closure phase of plosives and six allophones DX, NX, Q, HV, AX-H, UX denoting flapped variants of /D/ and /N/, a glottal variant of /T/, a voiced variant of /H/, a devoiced schwa-vowel, and a fronted /U/-variant, respectively. A SPE based distinctive feature composition of the CMU Label Set in the TIMIT database 5 • In the speech transcription files, words are sometimes transcribed with altering phonemic references taking individual or regional vowel variability into account, whereas the lexicon transcriptions must be considered more “neutral” and independent on individual and regional deviancies (within the USA). For instance, the lexicon transcribes words like “for” and “more” /f ao1 r/, /m ao1 r/ (IPA [ ], [ ]), whereas the speech transcriptions also have /f ow1 r/, /m ow1 r/ (IPA [ ], [ ] with the same vowel quality as in e.g. “low “). IoU IR:U PoU PR:U As the present report deals with a phonemic/phonological level rather than with a phonetic, we consider it our main task to set up a SPE-based decomposition table for the 46 phonemes used in the lexicon transcriptions. The speech transcriptions can be transformed into phonemic representations using the very simple replacement rules described in section 3. From a traditional phonemic point of view, there are only two less convenient or “surprising” details in the TIMIT inventory of phonemes used in the lexicon. The inventory consists of 28 consonants and 18 vowels, and with respect to the variations in the phonemic descriptions of English found in literature (notably variations in the descriptions of affricates, diphthongs, vocalised /r/, and diphthongs resulting from vocalised /r/ especially in British English), we only find it remarkable that • syllabic nasals and liquids are considered independent phonemes (normally they are interpreted phonemically as combinations of schwa + nasal/liquid, e.g. / / + / /, / / + / / etc.) . P Q • the inventory includes three schwa-vowels only occurring in unstressed syllables (normally the inventory only includes one such vowel). We consider these details a consequence of the fact that the phonemics of TIMIT represent a phonological direction very close to “substance” rather than a highly abstract and “formal” theory (we refer to the classical “form-substance” discussion in phonology). However, we do not think that TIMIT deals with pure phonetics. In terms of phonological “schools”, we may relate TIMIT to the post-Bloomfieldian tradition rather than to the classical European approaches (Prague, Copenhagen). From a generative phonological point of view (which in fact is the starting point of SPE), there is another aberrant aspect of the TIMIT inventory of phonemes. SPE only describes segments used in underlying representations which are object of phonological derivation rules. The TIMIT inventory of phonemes and the phonemic transcriptions in the lexicon rather correspond to a surface level, where all phonological rules have been applied. Therefore, it is not always possible to relate TIMIT symbols to the segments set up for English in SPE. Further the general SPE features have to be modified to describe the more varied set of TIMIT phonemes. 3. Feature Classes SPE sets up a total number of twenty-two feature classes, which according to the standard theory are sufficient for analysing expression segments (“phonemes”) of any language into distinctive oppositions. For a distinctive feature composition of the segments of a specific language, not all A SPE based distinctive feature composition of the CMU Label Set in the TIMIT database 6 twenty-two feature classes are utilised. For instance, the SPE-description of English segments (SPE 1968 p. 176 f.) makes references only to thirteen feature classes. The remaining nine classes may be regarded as redundant or “irrelevant” to English. In general, we have tried to preserve the original thirteen distinctive features used in SPE for the composition of English segments. However, for certain reasons to be discussed below, we have made some changes. The changes are described in the table below SPE HERE VOCALIC SONORANT SYLLABIC CONSONANTAL HIGH BACK LOW FRONT ANTERIOR CORONAL ROUND VOICE (TENSE) CONTINUANT NASAL STRIDENT CONSONANTAL HIGH BACK LOW ANTERIOR CORONAL ROUND VOICE TENSE CONTINUANT NASAL STRIDENT 7DEOH(QJOLVK)HDWXUH&ODVVHVLQ63(DQGLQWKHSUHVHQWUHSRUW In short, we have replaced the feature vocalic with sonorant and syllabic, added a vocalic feature front and removed tense. 3.1. Major Class Features We prefer the feature syllabic instead of vocalic (for discussion, cf. TRENDS p. 226 ff.). The main reason is the use of syllabic nasals and laterals in the TIMIT label set /EM EN ENG EL/. This leads to five major classes of segments: SONORANT SYLLABIC Consonantal VOWELS GLIDES /W Y HH/ SYLLABIC LIQUIDS AND NASALS /EL EM EN ENG/ NON-SYLLABIC LIQUIDS AND NASALS /R L M N NG/ OBSTRUENTS + + - + - + + + + + + 7DEOH0DMRU&ODVV)HDWXUHV The main categories differs from that of the TIMIT-description (p. 30 p., note the printing error, the missing label "nasal"), where liquids are put together with glides instead of with nasals, and where stops, affricates, and fricatives are considered separate main groups. A SPE based distinctive feature composition of the CMU Label Set in the TIMIT database 7 3.2. Vowels The TIMIT symbols describing vowels is the most difficult phoneme class to relate to the SPE concept. The main reasons are: The TIMIT vowels deal with regional (North American) pronunciation variants whereas SPE is based on a more abstract, supraregional English phonological system. In TIMIT, diphthongs are analysed monophonematically (as single phonemes) whereas in generative phonology there are no phonological features describing acoustic events like "rising" or "falling" vowel quality. In SPE, the English diphthongs are represented as long monophthongs and the surface structure is derived by an appropriate diphthongization rule (SPE p. 183). As the TIMIT label set, of course, describes a representation closer to surface, the diphthongs has to be analysed polyphonematically. The polyphonematic analysis implies that a new phoneme boundary must be placed in the centre of the diphthong, the first half labelled with a short vowel and the second one with a glide /W/ or /Y/. The new phoneme boundary can eventually be adjusted manually. The polyphonematic replacements should be carried out on EY, OW, AW, AY, OY (as in: brain /b r ey1 n/, broke /b r ow1 k/, brown /b r aw1 n/, buy /b ay1/, and boy /b oy1/. For two only "slightly" rising vowel segments, IY and UW (as in breed /b r iy1 d/ and room /r uw1 m/), it is not obvious, whether they should be analysed polyphonemically. In most (British English) descriptions, they are threaded as normal long monophthongs and kept apart from the "real" diphthongs. The composition table for vowels suggested below. leaves it open how to analyse /IY/ and /UW/. This leads to the following polyphonematic replacements. EY OW AW AY OY (IY (UW -> -> -> -> -> -> -> EH Y OH W AH W AH Y AO Y IH W) UH W) 7DEOH3RO\SKRQHPDWLFUHSODFHPHQWV In general, the suggested polyphonematic replacements implies a reduction of the TIMIT vowel inventory: /EY/, /OW/, /AW/, /AY/, /OY/ and “optionally” /IY/ and /UW are removed and replaced by a short vowel /EH, /OH/, /AH/, /AO, /IH/, /UH/ + a glide /W/ or /Y/. However, one of the short vowels /OH/ is not in the “native” TIMIT set of vowels. 3.2.1. Tense In general, if /IY/ and /UW/ are interpreted polyphonemically as suggested above, the feature tense has no discriminating (distinctive, phonological) function. • /ER/ can be both plus and minus tense, e.g. backwards /b ae1 k w er d z/, birds /b er1 d z/. A SPE based distinctive feature composition of the CMU Label Set in the TIMIT database 8 • /AO/ is in the TIMIT transcriptions mostly plus tense: sauce /s ao1 s/, salt /s ao1 l t/. However, in horror /hh ao1 r axr/, and combinations with /r/, e.g. horse, the symbol represents a minus tense pronunciation. • If /OY/, as suggested in section 2.2., is replaced by /AO Y/, /AO/ cannot be said to have any characteristic tense value. • /AA/ is (mostly) not plus tense. British long /A/ is TIMIT AE, and British short /O/ is either /AO/ (e.g. "accomplish") or /AH/ (mother). However, in "morrow" /m aa1 r ow2/ /AA/ is plus tens (in normal American pronunciation). • /AE/ is mostly minus tense (however: rather /r ae1 dh axr/, gathered /g ae1 dh axr d/). This leads to the conclusion that tense should not be used in the composition table for the vowels. In ASR based on traditional HMMs modelling phonemes or units derived from phonemes (diphones, triphones), length (quantity) is normally abandoned to prosody and like other prosodemes (stress, word accents like the Danish “stød” or “accent 1” accent 2” in Swedish and Norwegian etc.) “ignored” by the decoding algorithm. 3.2.2. Front The feature front, is not used in the SPE-description of English segments (SPE 235). As TIMIT distinguishes between three unstressed vowels /AX IX AXR/ (whereas SPE only has one), we need the front feature to define a place of articulation between plus back and minus back. Otherwise /AX/ and /AXR/ cannot be separated from the back-vowels /AH/ and /AA/. This leads to the following “vowel-triangle” defined by three degrees of openness: HIGH LOW 7DEOH2SHQQHVV (IY UW) IH UH IX + - ER EH OH AX AH - AO AA AE AXR + ER IX AX AXR (UW) UH OH AO AA AH + and three places of articulation: (IY) IH EH AE FRONT + BACK 7DEOH3ODFHRIDUWLFXODWLRQ - /IX/ and /AX/ can be considered an upper and lower variant of the “neutral” vowel /(/ (in SPE, features are defined on the basis of deviation from a “neutral position”, cf. SPE p. 300, TRENDS p. 226). Hence, there is no segment defined solely by negative features. A SPE based distinctive feature composition of the CMU Label Set in the TIMIT database 9 3.2.3. The Vowel Composition Table The description suggested in the table below, presupposes two modifications of the TIMIT labels. Firstly, the polyphonematic replacements described above in. Secondly the following two Allophone-phoneme replacements: UX AX-H -> -> UW AX 7DEOH9RFDODOORSKRQHSKRQHPHUHSODFHPHQWV The labels UX and AX-H only occur in the label files, and not in the lexicon. As we consider the lexical transcriptions more phonematic, UX is to be replaced by UW, and AX-H by AX (cf. TIMIT p. 29). /UW/ may again be replaced by /UH W/ as described in Table 3. This leads to the following vowel inventory consisting of two slightly rising diphthongs /IY UW/ (which can be interpreted polyphonematically), nine "normal" vowels (each of which can be identified with one or two SPE segments), and three unstressed vowels: TIMIT SPE (IY) (UW) ER AO AA IH UH EH AE OH AH AX IX AXR SONOR. SYLLABIC CONSON. HIGH BACK FRONT LOW ROUND (TENSE) ANTERIOR CORONAL VOICE CONT. NASAL STRIDENT + + + + + + + + + + + + + + + + + + + + + + + - + + + + - + + + + + - + + + - + + + + - + + + + - + + + - + + - + + + - + + + - ' ' 'T 7DEOH'LVWLQFWLYH)HDWXUH&RPSRVLWLRQRI7,0,79RZHO6HJPHQWV 3.3. Consonants As opposed to the vowels, the TIMIT consonantal labels can easily be related to the English SPEsegments. The analysis presupposes some modifications. 3.3.1. The Consonantal Composition Table In the TIMIT label files (but not in the TIMIT lexicon), plosives are analysed polyphonmically (as sequences of two phonemes) in a closure phase and a release phase, e.g. "She had your dark suit in greasy wash water all year" /sh iy hv ae dcl d y er dcl d aa r kcl k s uw dx ih ng gcl g r iy s iy w aa sh A SPE based distinctive feature composition of the CMU Label Set in the TIMIT database 10 epi w aa dx er q ao l y ih axr h#/. This makes the following monophonematic replacements necessary PCL P -> P TCL T -> T KCL K -> K BCL B -> B DCL D -> D GCL G -> G 7DEOH0RQRSKRQHPDWLFUHSODFHPHQWV Finally, allophones denoting flapped variants, voiced, intervocalic /h/ etc. must be replaced by the corresponding phonemes: Allophone-phoneme replacements: DX -> D NX -> N Q -> T HV -> H 7DEOH&RQVRQDWDODOORSKRQHSKRQpPHUHSODFHPHQWV The final inventory of phonemes corresponds to the SPE composition table of the English segments except for the syllabic liquids and nasals: A SPE based distinctive feature composition of the CMU Label Set in the TIMIT database 11 TIMIT SPE B D G P T K JH CH S SH Z ZH F TH V DH D F I R V M L¾ E¾ U U¿ \ \¿ H 6 X & SONOR. SYLLABIC CONSON HIGH BACK FRONT LOW ROUND (TENSE) ANTERIOR CORONAL VOICE CONT NASAL STRIDENT + - + - + + + - + - + + + + + + + - + + + - + + + - + - + - + - + + - + + + - + - + - + + - - + + + + + + + + + + + + + + + + + + + + + + + + + + + - + + + + + + + + - TIMIT SPE Y [ J HH R L M N NG EL EM 'O 'P '0 + + + + - + + + - + + + + - + + - + + - + + - + + + + + + - + + + - + + + + - - - + + + - - - - + - + + + - + + + + - + + + - + + + + - + + - + + + + - + + + - + + + + - + + - SONOR SYLLABIC CONSON. HIGH BACK FRONT LOW ROUND (TENSE) ANTERIOR CORONAL VOICE CONT. NASAL STRIDENT W Y T N O P 0 'N EN ENG 7DEOH'LVWLQFWLYH)HDWXUH&RPSRVLWLRQRI7,0,7&RQVRQDQWDO6HJPHQWV A SPE based distinctive feature composition of the CMU Label Set in the TIMIT database 12 4. References (BITAR 1995) N.N. Bitar & C.Y. Espy-Wilson: A Signal Representation of Speech based on Phonetic Features. IEEE WS 1995 (CLEMENTs 1985) N.G. Clements: The Geometry of Phonological Features. Phonology Yearbook 2, 1985, pp. 225-252 (SPE 1968) N. Chomsky & M. Halle: The Sound Pattern of English, Harper & Row, New York, Evanston, London 1968. (TIMIT 1993) J.S. Garofolo et al.: DARPA TIMIT. Acoustic-Phonetic Continous Speech Corpus. NistIR 4930, 1993 (TRENDS 1975) E. Fischer-Jørgensen: Trends in Phonological Theory. Akademisk Forlag, Copenhagen 1975. (FUNDAMENTALS 1956) R. Jakobsen & M. Halle: Fundamentals of Language, 1956 A SPE based distinctive feature composition of the CMU Label Set in the TIMIT database 13