A SPE based distinctive feature composition of the CMU Label Set in

advertisement
version 1.0 July 98
Tom Brøndsted
tb@cpk.auc.dk
Center for PersonKommunikation, Aalborg University, 1998
A SPE based distinctive feature composition of the CMU Label
Set in the TIMIT database.
IR 98-1001
Content:
0. List of Tables ................................................................................................................................... 3
1. Introduction...................................................................................................................................... 4
2. Levels of descriptions: phonetics, phonemics, underlying form, surface form. .............................. 5
3. Feature Classes ................................................................................................................................ 6
3.1. Major Class Features ................................................................................................................. 7
3.2. Vowels....................................................................................................................................... 8
3.2.1. Tense ................................................................................................................................... 8
3.2.2. Front .................................................................................................................................... 9
3.2.3. The Vowel Composition Table......................................................................................... 10
3.3. Consonants .............................................................................................................................. 10
3.3.1. The Consonantal Composition Table................................................................................ 10
4. References...................................................................................................................................... 13
A SPE based distinctive feature composition of the CMU Label Set in the TIMIT database
2
0.
List of Tables
Table 1: English Feature Classes in SPE and in the present report ..................................................... 7
Table 2: Major Class Features ............................................................................................................. 7
Table 3: Polyphonematic replacements ............................................................................................... 8
Table 4: Openness................................................................................................................................ 9
Table 5: Place of articulation............................................................................................................... 9
Table 6: Vocal allophone-phoneme replacements............................................................................. 10
Table 7: Distinctive Feature Composition of TIMIT Vowel Segments ............................................ 10
Table 8: Monophonematic replacements........................................................................................... 11
Table 9: Consonatal allophone-phonéme replacements .................................................................... 11
Table 10: Distinctive Feature Composition of TIMIT Consonantal Segments ................................. 12
A SPE based distinctive feature composition of the CMU Label Set in the TIMIT database
3
1.
Introduction
The background of this report is a number of speech research activities within Center for
PersonKommunikation, Aalborg University, which attempt to start from phonological features
rather than from phonemes or segments derived from phonemes (diphones, triphones, generalised
triphones) as in traditional speech technology. The report provides a phonological feature analysis
of the CMU label set in the TIMIT database (see TIMIT 1993) based on the standard generative
theory by Chomsky & Halle, the so-called “SPE”-theory (see SPE 1968). For a number of reasons,
such an analysis is not merely a matter of identifying each TIMIT-label with a corresponding
segment used in the feature composition of English by Chomsky & Halle. These reasons are
described in Section 2.
This reports consists of two main sections, the first of which (Section 2) discusses some theoretical
and practical issues, and the second (Section 3) contains the “ready-to-use” composition tables and
replacement rules. Of course, the present report cannot aim at giving a general introduction to
(generative) phonology. For readers unfamiliar with expressions like “surface structure”, “deep
structure”, “underlying representation”, “morphophonemic”, “derivation rule” etc., we refer to E.
Fischer-Jørgensens (TRENDS 1975) or similar books. Or such readers may simply proceed directly
to the results presented in Section 3, Table 7 and Table 10. Note, that if the composition tables
suggested in here are used for some kind of automatic processing of the TIMIT database (acoustic
analysis, training of acoustic models or neural networks, acoustic recognition etc.), the TIMIT label
files (the transcriptions files with the extension *.phn and/or the lexicon file timitdic.txt) must be
modified as described in the “replacement rules” in section 3, Table 3, Table 6, Table 8, and Table
9.
The distinctive feature composition presented in this report is, of course, not the only possible
solution to the problem of analysing the TIMIT/CMU label into phonological features. The starting
point of our composition can be described as follows:
„
We wanted to use SPE as a generally accepted standard theory of phonology and with as few
modifications as possible.
„
Most notably, we have tried to utilise the Chomsky & Halle decomposition of English segments
(SPE 1968, p. 176 f) as directly as possible.
„
Finally, we have attempted to make as few changes to the TIMIT/CMU label set as possible.
Hence, our starting point can be paraphrased as an attempt to “merge” TIMIT with SPE.
Of course, we could have attempted to merge TIMIT with other phonological theories than SPE.
The older theory of distinctive features introduced by R. Jakobsen (FUNDAMENTALS 1956) may
be interesting as well, as feature classes are defined mostly auditorily (as opposed to the rather
articulatory definitions in SPE). Provided that we expect feature classes to have some kind of direct
counterparts on the acoustic level, an auditory approach may be more interesting than an
articulatory, since it in this case seems more important to concentrate on how speech units sound
than on how they are produced. On the other hand, the auditory approach of R. Jacobsen is far more
abstract (closer to “form”, farther from “substance”) than SPE. For instance, R. Jakobsen sets up a
A SPE based distinctive feature composition of the CMU Label Set in the TIMIT database
4
total of twelve universal inherent feature classes, whereas SPE uses twenty-two. This means, that
many feature classes are used for separating equivocal pairs of sound classes (e.g. “compact/diffuse”
separates open vowels from narrow ones as well as back consonants from front ones), and
Jakobsen’s phonetic descriptions of the separate feature classes are not always transparent (cf.
TRENDS 1975 p. 156).
Also, newer phonological theories can be taken into consideration. However, as these theories rarely
contribute directly to the theory of distinctive features, most of them are irrelevant to the task
defined in the present report. Some “non-linear” phonological theories advocate a hierarchical
description of feature bundles based on examination of phonological phenomena (mainly
alternations but also articulatory constraints, cf. CLEMENTS 1985). These theories are opposed to
SPE where features are organised in 1-dimensional sets. However, the idea that an inventory of
expression segments can be described in terms of a hierarchical tree structure where upper nodes
represent major class features (like +/- vocalic, +/- consonantal) and lower nodes cavity features,
manner of articulation etc., and terminal nodes are phonemes is not specific to non-linear
phonology. We consider this a misunderstanding by Bitar & Espy-Wilson (BITAR 1995 p. 311).
Note that the decomposition tables in Table 7 and Table 10 “mechanically” can be represented as a
hierarchical tree when nodes are built top-down using the feature classes in the order given in the
left column.
2.
Levels of descriptions: phonetics, phonemics, underlying form,
surface form.
The TIMIT-database consists mainly of a number of speech waveform files (*.wav) each of which is
associated with an orthographic transcription (*.txt), a time-aligned word transcription (*.wrd), and
a time-aligned phonetic transcription (*.phn). All words in the time-aligned word transcription files
are listed in the lexicon (timitdic.txt), and each entry in the lexicon consists of an orthographic
representation, in case of phonetic ambiguity a syntactic classification (e.g. “live” which dependent
on word class is pronounced either [O$,Y] or [O,Y]) and a phonetic/phonemic transcription. Both the
phonetic *.phn transcription files and the transcriptions of the lexicon are based on the set of CMU
labels, each of which is a simple ASCII-representation of an IPA-symbol (eventually augmented
with diacritic markers) like CMU /NG/= IPA [ ] or CMU /ENG/ = IPA [ ] etc. However, there is no
1:1 relation between the transcriptions of the lexicon and the *.phn-transcription files. The
transcription files represent a narrow phonetic level where label-decisions have been “based on
careful listening of portions of the speech waveform, as well as visual examination using displays
such as the spectrogram and the original waveform” (TIMIT p. 40). The lexicon-transcriptions
represent a kind of phonemic level where labels are taken from a smaller set of symbols (denoting
“phonemes”; we use the term “phonemics” in the Post-Bloomfieldian sense). In concrete terms,
there are the following differences between lexical transcriptions and speech transcriptions:
1
1ˆ
• The symbols of the lexical transcriptions are taken from a smaller set of in total 48 symbols (46
“phonemes” and two “prosodemes”), whereas the speech transcriptions also include two nonspeech symbols (pause, epithentic silence), six symbols denoting the closure phase of plosives
and six allophones DX, NX, Q, HV, AX-H, UX denoting flapped variants of /D/ and /N/, a
glottal variant of /T/, a voiced variant of /H/, a devoiced schwa-vowel, and a fronted /U/-variant,
respectively.
A SPE based distinctive feature composition of the CMU Label Set in the TIMIT database
5
• In the speech transcription files, words are sometimes transcribed with altering phonemic
references taking individual or regional vowel variability into account, whereas the lexicon
transcriptions must be considered more “neutral” and independent on individual and regional
deviancies (within the USA). For instance, the lexicon transcribes words like “for” and “more” /f
ao1 r/, /m ao1 r/ (IPA [ ], [
]), whereas the speech transcriptions also have /f ow1 r/, /m
ow1 r/ (IPA [
], [
] with the same vowel quality as in e.g. “low “).
IoU
IR:U
PoU
PR:U
As the present report deals with a phonemic/phonological level rather than with a phonetic, we
consider it our main task to set up a SPE-based decomposition table for the 46 phonemes used in the
lexicon transcriptions. The speech transcriptions can be transformed into phonemic representations
using the very simple replacement rules described in section 3.
From a traditional phonemic point of view, there are only two less convenient or “surprising” details
in the TIMIT inventory of phonemes used in the lexicon. The inventory consists of 28 consonants
and 18 vowels, and with respect to the variations in the phonemic descriptions of English found in
literature (notably variations in the descriptions of affricates, diphthongs, vocalised /r/, and
diphthongs resulting from vocalised /r/ especially in British English), we only find it remarkable
that
• syllabic nasals and liquids are considered independent phonemes (normally they are interpreted
phonemically as combinations of schwa + nasal/liquid, e.g. / / + / /, / / + / / etc.) .

P

Q
• the inventory includes three schwa-vowels only occurring in unstressed syllables (normally the
inventory only includes one such vowel).
We consider these details a consequence of the fact that the phonemics of TIMIT represent a
phonological direction very close to “substance” rather than a highly abstract and “formal” theory
(we refer to the classical “form-substance” discussion in phonology). However, we do not think that
TIMIT deals with pure phonetics. In terms of phonological “schools”, we may relate TIMIT to the
post-Bloomfieldian tradition rather than to the classical European approaches (Prague,
Copenhagen).
From a generative phonological point of view (which in fact is the starting point of SPE), there is
another aberrant aspect of the TIMIT inventory of phonemes. SPE only describes segments used in
underlying representations which are object of phonological derivation rules. The TIMIT inventory
of phonemes and the phonemic transcriptions in the lexicon rather correspond to a surface level,
where all phonological rules have been applied. Therefore, it is not always possible to relate TIMIT
symbols to the segments set up for English in SPE. Further the general SPE features have to be
modified to describe the more varied set of TIMIT phonemes.
3.
Feature Classes
SPE sets up a total number of twenty-two feature classes, which according to the standard theory are
sufficient for analysing expression segments (“phonemes”) of any language into distinctive
oppositions. For a distinctive feature composition of the segments of a specific language, not all
A SPE based distinctive feature composition of the CMU Label Set in the TIMIT database
6
twenty-two feature classes are utilised. For instance, the SPE-description of English segments (SPE
1968 p. 176 f.) makes references only to thirteen feature classes. The remaining nine classes may be
regarded as redundant or “irrelevant” to English.
In general, we have tried to preserve the original thirteen distinctive features used in SPE for the
composition of English segments. However, for certain reasons to be discussed below, we have
made some changes. The changes are described in the table below
SPE
HERE
VOCALIC
SONORANT
SYLLABIC
CONSONANTAL
HIGH
BACK
LOW
FRONT
ANTERIOR
CORONAL
ROUND
VOICE
(TENSE)
CONTINUANT
NASAL
STRIDENT
CONSONANTAL
HIGH
BACK
LOW
ANTERIOR
CORONAL
ROUND
VOICE
TENSE
CONTINUANT
NASAL
STRIDENT
7DEOH(QJOLVK)HDWXUH&ODVVHVLQ63(DQGLQWKHSUHVHQWUHSRUW
In short, we have replaced the feature vocalic with sonorant and syllabic, added a vocalic feature
front and removed tense.
3.1.
Major Class Features
We prefer the feature syllabic instead of vocalic (for discussion, cf. TRENDS p. 226 ff.). The main
reason is the use of syllabic nasals and laterals in the TIMIT label set /EM EN ENG EL/. This leads
to five major classes of segments:
SONORANT
SYLLABIC
Consonantal
VOWELS
GLIDES
/W Y HH/
SYLLABIC
LIQUIDS AND
NASALS /EL EM
EN ENG/
NON-SYLLABIC
LIQUIDS AND
NASALS /R L M N
NG/
OBSTRUENTS
+
+
-
+
-
+
+
+
+
+
+
7DEOH0DMRU&ODVV)HDWXUHV
The main categories differs from that of the TIMIT-description (p. 30 p., note the printing error, the
missing label "nasal"), where liquids are put together with glides instead of with nasals, and where
stops, affricates, and fricatives are considered separate main groups.
A SPE based distinctive feature composition of the CMU Label Set in the TIMIT database
7
3.2. Vowels
The TIMIT symbols describing vowels is the most difficult phoneme class to relate to the SPE
concept. The main reasons are: The TIMIT vowels deal with regional (North American)
pronunciation variants whereas SPE is based on a more abstract, supraregional English phonological
system. In TIMIT, diphthongs are analysed monophonematically (as single phonemes) whereas in
generative phonology there are no phonological features describing acoustic events like "rising" or
"falling" vowel quality. In SPE, the English diphthongs are represented as long monophthongs and
the surface structure is derived by an appropriate diphthongization rule (SPE p. 183).
As the TIMIT label set, of course, describes a representation closer to surface, the diphthongs has to
be analysed polyphonematically. The polyphonematic analysis implies that a new phoneme
boundary must be placed in the centre of the diphthong, the first half labelled with a short vowel and
the second one with a glide /W/ or /Y/. The new phoneme boundary can eventually be adjusted
manually. The polyphonematic replacements should be carried out on EY, OW, AW, AY, OY (as
in: brain /b r ey1 n/, broke /b r ow1 k/, brown /b r aw1 n/, buy /b ay1/, and boy /b oy1/. For two
only "slightly" rising vowel segments, IY and UW (as in breed /b r iy1 d/ and room /r uw1 m/), it is
not obvious, whether they should be analysed polyphonemically. In most (British English)
descriptions, they are threaded as normal long monophthongs and kept apart from the "real"
diphthongs. The composition table for vowels suggested below. leaves it open how to analyse /IY/
and /UW/. This leads to the following polyphonematic replacements.
EY
OW
AW
AY
OY
(IY
(UW
->
->
->
->
->
->
->
EH Y
OH W
AH W
AH Y
AO Y
IH W)
UH W)
7DEOH3RO\SKRQHPDWLFUHSODFHPHQWV
In general, the suggested polyphonematic replacements implies a reduction of the TIMIT vowel
inventory: /EY/, /OW/, /AW/, /AY/, /OY/ and “optionally” /IY/ and /UW are removed and replaced
by a short vowel /EH, /OH/, /AH/, /AO, /IH/, /UH/ + a glide /W/ or /Y/. However, one of the short
vowels /OH/ is not in the “native” TIMIT set of vowels.
3.2.1. Tense
In general, if /IY/ and /UW/ are interpreted polyphonemically as suggested above, the feature tense
has no discriminating (distinctive, phonological) function.
• /ER/ can be both plus and minus tense, e.g. backwards /b ae1 k w er d z/, birds /b er1 d z/.
A SPE based distinctive feature composition of the CMU Label Set in the TIMIT database
8
• /AO/ is in the TIMIT transcriptions mostly plus tense: sauce /s ao1 s/, salt /s ao1 l t/. However,
in horror /hh ao1 r axr/, and combinations with /r/, e.g. horse, the symbol represents a minus
tense pronunciation.
• If /OY/, as suggested in section 2.2., is replaced by /AO Y/, /AO/ cannot be said to have any
characteristic tense value.
• /AA/ is (mostly) not plus tense. British long /A/ is TIMIT AE, and British short /O/ is either /AO/
(e.g. "accomplish") or /AH/ (mother). However, in "morrow" /m aa1 r ow2/ /AA/ is plus tens (in
normal American pronunciation).
• /AE/ is mostly minus tense (however: rather /r ae1 dh axr/, gathered /g ae1 dh axr d/).
This leads to the conclusion that tense should not be used in the composition table for the vowels. In
ASR based on traditional HMMs modelling phonemes or units derived from phonemes (diphones,
triphones), length (quantity) is normally abandoned to prosody and like other prosodemes (stress,
word accents like the Danish “stød” or “accent 1” accent 2” in Swedish and Norwegian etc.)
“ignored” by the decoding algorithm.
3.2.2. Front
The feature front, is not used in the SPE-description of English segments (SPE 235). As TIMIT
distinguishes between three unstressed vowels /AX IX AXR/ (whereas SPE only has one), we need
the front feature to define a place of articulation between plus back and minus back. Otherwise /AX/
and /AXR/ cannot be separated from the back-vowels /AH/ and /AA/. This leads to the following
“vowel-triangle” defined by three degrees of openness:
HIGH
LOW
7DEOH2SHQQHVV
(IY UW) IH UH IX
+
-
ER EH OH AX AH
-
AO AA AE AXR
+
ER IX AX AXR
(UW) UH OH AO AA
AH
+
and three places of articulation:
(IY) IH EH AE
FRONT
+
BACK
7DEOH3ODFHRIDUWLFXODWLRQ
-
/IX/ and /AX/ can be considered an upper and lower variant of the “neutral” vowel /(/ (in SPE,
features are defined on the basis of deviation from a “neutral position”, cf. SPE p. 300, TRENDS p.
226). Hence, there is no segment defined solely by negative features.
A SPE based distinctive feature composition of the CMU Label Set in the TIMIT database
9
3.2.3. The Vowel Composition Table
The description suggested in the table below, presupposes two modifications of the TIMIT labels.
Firstly, the polyphonematic replacements described above in. Secondly the following two
Allophone-phoneme replacements:
UX
AX-H
->
->
UW
AX
7DEOH9RFDODOORSKRQHSKRQHPHUHSODFHPHQWV
The labels UX and AX-H only occur in the label files, and not in the lexicon. As we consider the
lexical transcriptions more phonematic, UX is to be replaced by UW, and AX-H by AX (cf. TIMIT
p. 29). /UW/ may again be replaced by /UH W/ as described in Table 3.
This leads to the following vowel inventory consisting of two slightly rising diphthongs /IY UW/
(which can be interpreted polyphonematically), nine "normal" vowels (each of which can be
identified with one or two SPE segments), and three unstressed vowels:
TIMIT
SPE
(IY)
(UW)
ER
AO
AA
IH
UH
EH
AE
OH
AH
AX
IX
AXR
SONOR.
SYLLABIC
CONSON.
HIGH
BACK
FRONT
LOW
ROUND
(TENSE)
ANTERIOR
CORONAL
VOICE
CONT.
NASAL
STRIDENT
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
-
+
+
+
+
-
+
+
+
+
+
-
+
+
+
-
+
+
+
+
-
+
+
+
+
-
+
+
+
-
+
+
-
+
+
+
-
+
+
+
-
'
'
'T
7DEOH'LVWLQFWLYH)HDWXUH&RPSRVLWLRQRI7,0,79RZHO6HJPHQWV
3.3. Consonants
As opposed to the vowels, the TIMIT consonantal labels can easily be related to the English SPEsegments. The analysis presupposes some modifications.
3.3.1. The Consonantal Composition Table
In the TIMIT label files (but not in the TIMIT lexicon), plosives are analysed polyphonmically (as
sequences of two phonemes) in a closure phase and a release phase, e.g. "She had your dark suit in
greasy wash water all year" /sh iy hv ae dcl d y er dcl d aa r kcl k s uw dx ih ng gcl g r iy s iy w aa sh
A SPE based distinctive feature composition of the CMU Label Set in the TIMIT database
10
epi w aa dx er q ao l y ih axr h#/. This makes the following monophonematic replacements
necessary
PCL P -> P
TCL T -> T
KCL K -> K
BCL B -> B
DCL D -> D
GCL G -> G
7DEOH0RQRSKRQHPDWLFUHSODFHPHQWV
Finally, allophones denoting flapped variants, voiced, intervocalic /h/ etc. must be replaced by the
corresponding phonemes: Allophone-phoneme replacements:
DX
-> D
NX
-> N
Q
-> T
HV
-> H
7DEOH&RQVRQDWDODOORSKRQHSKRQpPHUHSODFHPHQWV
The final inventory of phonemes corresponds to the SPE composition table of the English segments
except for the syllabic liquids and nasals:
A SPE based distinctive feature composition of the CMU Label Set in the TIMIT database
11
TIMIT
SPE
B
D
G
P
T
K
JH
CH
S
SH
Z
ZH
F
TH
V
DH
D
F
I
R
V
M
L¾
E¾
U
U¿
\
\¿
H
6
X
&
SONOR.
SYLLABIC
CONSON
HIGH
BACK
FRONT
LOW
ROUND
(TENSE)
ANTERIOR
CORONAL
VOICE
CONT
NASAL
STRIDENT
+
-
+
-
+
+
+
-
+
-
+
+
+
+
+
+
+
-
+
+
+
-
+
+
+
-
+
-
+
-
+
-
+
+
-
+
+
+
-
+
-
+
-
+
+
-
-
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
-
+
+
+
+
+
+
+
+
-
TIMIT
SPE
Y
[
J
HH
R
L
M
N
NG
EL
EM
'O
'P
'0
+
+
+
+
-
+
+
+
-
+
+
+
+
-
+
+
-
+
+
-
+
+
-
+
+
+
+
+
+
-
+
+
+
-
+
+
+
+
-
-
-
+
+
+
-
-
-
-
+
-
+
+
+
-
+
+
+
+
-
+
+
+
-
+
+
+
+
-
+
+
-
+
+
+
+
-
+
+
+
-
+
+
+
+
-
+
+
-
SONOR
SYLLABIC
CONSON.
HIGH
BACK
FRONT
LOW
ROUND
(TENSE)
ANTERIOR
CORONAL
VOICE
CONT.
NASAL
STRIDENT
W
Y
T
N
O
P
0
'N
EN
ENG
7DEOH'LVWLQFWLYH)HDWXUH&RPSRVLWLRQRI7,0,7&RQVRQDQWDO6HJPHQWV
A SPE based distinctive feature composition of the CMU Label Set in the TIMIT database
12
4.
References
(BITAR 1995) N.N. Bitar & C.Y. Espy-Wilson: A Signal Representation of Speech based on
Phonetic Features. IEEE WS 1995
(CLEMENTs 1985) N.G. Clements: The Geometry of Phonological Features. Phonology Yearbook
2, 1985, pp. 225-252
(SPE 1968) N. Chomsky & M. Halle: The Sound Pattern of English, Harper & Row, New York,
Evanston, London 1968.
(TIMIT 1993) J.S. Garofolo et al.: DARPA TIMIT. Acoustic-Phonetic Continous Speech Corpus.
NistIR 4930, 1993
(TRENDS 1975) E. Fischer-Jørgensen: Trends in Phonological Theory. Akademisk Forlag,
Copenhagen 1975.
(FUNDAMENTALS 1956) R. Jakobsen & M. Halle: Fundamentals of Language, 1956
A SPE based distinctive feature composition of the CMU Label Set in the TIMIT database
13
Download