fnl_rprt - LDC Catalog

advertisement
DRAFT of 4/20/95 -- JFP
PhoneBook Final Report
--------- ----- -----John F. Pitrelli, Cynthia Fong, Hong C. Leung
April 20, 1995
OVERVIEW
PhoneBook is a phonetically-rich, isolated-word, telephone-speech
database,
created because of (1) the lack of available large-vocabulary isolatedword
data, (2) anticipated continued importance of isolated-word and
keyword-spotting technology to speech-recognition-based applications over
the
telephone, and (3) findings that continuous-speech training data is
inferior
to isolated-word training for isolated-word recognition.
The goal of PhoneBook is to serve as a large database of American English
word
utterances incorporating all phonemes in as many segmental/stress
contexts as
are likely to produce coarticulatory variations, while also spanning a
variety
of talkers and telephone transmission characteristics. We anticipate
that it
will be useful in ways analogous to TIMIT/NTIMIT.
The core section of PhoneBook consists of a total of 93,667 isolated-word
utterances, totalling 23 hours of speech. This breaks down to 7979
distinct
words, each said by an average of 11.7 talkers, with 1358 talkers each
saying
up to 75 words. All data were collected in 8-bit mu-law digital form
directly
from a T1 telephone line. Talkers were adult native speakers of American
English chosen to be demographically representative of the U.S.
Given the large set of talkers being recruited for PhoneBook database, it
made
sense to exploit the opportunity to collect additional utterances. We
have
chosen spontaneous numerical utterances, because of widespread interest
in them
and the need for very large numbers of talkers for research into
spontaneous-speech effects. We restricted to just three spontaneous
digit
sequences and one money amount, as the lists for the core of PhoneBook
have been
designed to approach the limit of reasonable duration for a caller's
session.
As a result, PhoneBook contains a total of 5081 spontaneous utterances.
SUMMARY OF WORK
The following outline summarizes the development of PhoneBook, as well as
the
remainder of this document.
1.
Design isolated-word list.
A.
B.
C.
D.
Dictionary acquisition and format normalization
Word filtering
Phonemic/stress context enumeration criteria
Word-list-generation algorithm
2.
Data-collection equipment
3.
Recording session design
4. Design talker pool with market research subcontractor.
scope
and priority of balance of variables.
5.
Recruiting timetable and data collection.
6.
Data formatting and header creation.
A.
B.
C.
7.
Demographic
Waveform format
Filename formats
i. Isolated words
ii. Spontaneous speech
Header formats
i. Isolated words
ii. Spontaneous speech
Verification and transcription.
A.
B.
C.
D.
Talker-rejection criteria
Read-isolated-word-utterance rejection criteria
Spontaneous-utterance rejection criteria
Spontaneous-utterance transcription procedure
----------------------1.
DESIGN OF ISOLATED-WORD LIST.
The design of the phonetically-rich word list is central to the
PhoneBook.
The goal was to make the word list as compact as possible while
(1) incorporating each phoneme positioned in a variety of phonetic
and
stress contexts wide enough to cover all significant coarticulatory
variants, and
(2) inclucing only words with single pronunciations which would be
familiar
to talkers from a wide variety of backgrounds.
Our procedure was
(1) filter words with any of a variety of pronunciation problems out
of a
large source dictionary, in order to obtain a list of "candidate"
words
for the database,
(2) devise and apply to the candidate word list a criterion for
enumerating
phoneme sequences of length sufficient to capture coarticulatory
effects, and
(3) generate the final word list by choosing a subset of the
candidate list
in such a way as to capture each of the enumerated phoneme
sequences at
least once.
A.
Dictionary acquisition and format normalization
We began with the largest machine-readable dictionary sources
available,
primarily the 99,000-word CMU dictionary. We eliminated entries with
multiple words or with foreign phonemes, and reconciled the remaining
entries' phonemic representations into a common 42-phoneme inventory:
i
I
e
E
@
a
c
o
^
u
U
bEAt Y
bIt
O
bAIt W
bEt
R
bAt
x
bOb
X
bOUGHt L
bOAt l
bUt
w
bOOt r
bOOk y
bIte n
bOY
m
bOUt G
bIRd h
sofA s
buttER S
bottLE f
Let
T
Wet
z
Red
Z
Yet
v
Neat D
Meet p
siNG t
Heat k
See
b
SHe
d
Fee
g
THigh C
Zoo
J
meaSure
Van
THy
Pea
Tea
Key
Bee
Day
Geese
CHurCH
JuDGe
Vowels are marked with three levels of lexical stress -- "1" for
primary,
"2" for secondary, and no marking for reduced. These symbols, as
well as
"#" for word boundary, will be used in this document.
B.
Word filtering
The purpose of the large dictionary was to be a source from which to
draw
words for a reading task in which obtaining a particular
pronunciation from
a naive speaker was of key importance. This purpose presumably
diverges
from the goals of dictionary makers; consequently, several categories
of
problematic words had to be filtered out:
(1) Foreign words.
This distinction is fuzzy, due to
(a) varying levels of incompatibility between the word's foreign
pronunciation and the pronunciation rules of English,
(b) varying degrees of acceptance into English, and
(c) varying degrees of Anglicization of the pronunciation during
the
assimilation process.
The more assimilated, English-compatible and Anglicized words, such
as
"naive", were retained, while other words, such as "gendarme", were
filtered out.
(2) Difficult and obscure words. Some words would be pronounced
inconsistently because they are unfamiliar and puzzling, such as
"amanuensis" or "witenagemote". It is unlikely that we could
include
such a word for the purpose of obtaining any one particular
pronunciation, and hope to get even a substantial fraction of
talkers
from the general public to provide that pronunciation.
(3) Words with multiple acceptable pronunciations, whether the
pronunciations are associated with different meanings, such as
"produce", or not, such as "tomato". In either case, the word must
be
filtered out, as we cannot use such a word to elicit any one
pronunciation reliably.
(4) Potentially embarrassing words, such as those referring to
private
parts of the body. In addition to making talkers uncomfortable,
they
are unacceptable because they would likely disrupt some talkers'
attention to the following words on their lists.
(5) Words which are likely to be mis-read. Some words are much rarer
than
other similar-looking but differently-pronounced words. If we had
permitted "carousal", a rare word stressed on the second syllable,
to
be on the list, many readers would likely have said "carousel", a
common word stressed on the third syllable and therefore pronounced
very differently.
(6) Less common members of homonym pairs, such as "pidgin" and
"Lett".
These provide no advantage over the more familiar homonym.
(7) Acronyms. We have chosen not to consider these to be words.
Proper names were retained, subject to the above criteria, as they
effectively are English words, and are a good source of phoneme
sequences. After eliminating problematic words, we obtained a
filtered
dictionary of 37,549 words to serve as a candidate word list.
C.
Phonemic/stress context enumeration criteria
Our phoneme sequence enumeration is based both on common practice in
the
speech recognition community and on acoustic-phonetic and
articulatory
knowledge. Most speech recognition research approximates phonemic
contexts
using triphones as equivalence classes, implicitly assuming that the
primary coarticulatory effects on a phoneme are related to only the
one
preceding and one following phoneme. Some coarticulatory effects,
however,
reach beyond adjacent phonemes or are otherwise not covered by
traditional
triphone inventories. For example, the /u/ in "strewn" can cause some
degree of anticipatory rounding throughout the /str/ sequence.
Furthermore,
lexical stress influences the acoustics of both vowels and
consonants, but
is typically not accounted for completely and consistently by
triphone
enumeration. Finally, the position within the syllable of a phoneme
can
affect the phoneme's articulation and acoustics.
For these reasons we enumerate phonemic contexts in terms of a simple
syllable template, defining three "syllable parts" -- the "onset",
consisting of all consonants preceding the vowel; the "nucleus",
consisting
of the vowel, a mark indicating one of three stress levels, and one
postvocalic liquid if any; and the "coda", consisting of any
remaining
postvocalic consonants. However, to avoid reliance on the often-illdefined
syllable boundary within a sequence of consonants, we treat a codaonset
sequence as a single "part", "consonant sequence", with no loss of
specificity in each enumerated context.
Our inventory of phonemes-in-context contains all distinct sequences
of two
such syllable parts found in a dictionary. We also include a
triphone
inventory, due to the widespread interest in triphones in the
recognition
community. Stress marks were not considered to be part of a phone
for
purposes of determining the triphone inventory, though some stress
distinctions are implicitly represented in the vowel set, such as /I/
vs.
/|/, while others are not, such as stressed /i/ vs. unstressed /i/.
Beginning- and end-of-word were considered to be syllable parts
and phonemes for purposes of developing the two-syllable-partsequence and
triphone inventories, respectively. Henceforth we refer to
two-syllable-part sequences and triphones in aggregate as "contexts".
Working with the candidate word list, context enumeration yielded
10,839
triphones and 9458 two-syllable-part sequences.
D.
Word-list-generation algorithm
We then employed a greedy algorithm to extract a final word list
as compact as possible from the candidate word list, while still
covering our entire inventory of contexts. Our algorithm, which is
similar to one described by Kassel (ICSLP 1994, pp. 1827-1830), is:
(1) Identify each context that appears in only one word; accept those
words
into the final list and mark those contexts as covered.
(2) Assign to each uncovered context a score which is the reciprocal
of the
number of candidate words which contain it.
(3) Assign to each word a score which is the sum of the scores of its
contexts.
(4) Accept the highest-scoring word.
(5) Set the scores of that word's contexts to zero; update scores of
remaining candidate words.
(6) Go to (4), unless all remaining candidate words score zero.
Word-list generation reduced the original 37,549-word candidate list
down
to the PhoneBook's 7979-word list, while still containing every one
of the
contexts at least once. Because this algorithm favors compactness
measured
in terms of number of words, it tends to produce longer words than
average
text would contain.
Our final word list averages 2.62 syllables per
word
with a maximum of 8 ("counterrevolutionary" and "totalitarianism").
The software which generates the word list from the candidate list is
included in this distribution.
2.
DATA-COLLECTION EQUIPMENT
A PC-based recording platform was developed in our lab for this
and other similar data collections. The platform terminates a T1
trunk, enabling collection on up to 24 channels simultaneously. All
data were digitally recorded by the platform in the T1 line's 8-bit
mu-law format, with no analog conversion.
Four channels of a T1 span were used for PhoneBook. Talkers were
given a toll-free number connected to a hunt group for these
channels.
Talkers used their own telephone handsets to call our system. The
system was available for calling 21 hours a day (6 AM to 3 AM Eastern
Time), 7 days a week.
3.
RECORDING SESSION DESIGN
The 7979 words were divided into 106 disjoint lists, with 29
containing 76
words each and the remaining 77 containing 75 each. These lists were
padded
to 79 read words by adding three or four isolated read digits (not
included with PhoneBook) onto the beginning of each list, so that
talkers
could adjust to the flow of the session before reaching the words of
interest. Callers were given three seconds for each of these first
79
items.
The 80th item was a 9-digit number printed on the list; this number
identified
which of the 106 lists was being read, and it was formatted like a
Social
Security number (NNN-NN-NNNN). Callers were given seven seconds for
this item.
The remaining prompts were for spontaneous speech. The first seven
were for
information about the caller (utterances not included with
PhoneBook), the
next three elicited digit sequences, and the final one requested a
money
amount. The digit utterances are a ZIP code, intended to be five
digits
but could be nine, for which talkers were given four seconds; a
telephone
number, intended to be seven digits but could be eight, 10 or 11, for
which
five seconds were allowed; and a telephone number including an area
code,
intended to be 10 or 11 digits, for which eight seconds were allowed.
Eight
seconds were also provided for the money amount. All told, the call
lasted
approximately eight minutes. Following are the exact initial
greeting,
the 91 prompts, and the sign-off.
Thank you for calling the NYNEX Speech Database System.
For each item on your list you will hear the item number followed
by a beep. After the beep, please wait briefly and then say the
item.
Please say item 1.
Please say item 2.
Item 3.
Item 4.
Item 5.
Item 6.
7.
8.
(etc., through 80.)
What is your age?
Are you male or female?
What states and foreign countries did you live in before age 21?
What is your first language?
What languages were spoken at home while you were growing up?
If you have any speech or hearing problems, please mention them
now.
Please say your panel ID number.
Please say a ZIP code.
Please say a phone number.
Please say a phone number that includes an area code.
Please say an amount of money.
This concludes your session.
Thank you for participating.
Good-
bye.
Talkers attempting to call off hours heard the message "Thank you for
calling the NYNEX Speech Database System. Our system is operational
21 hours a day but not between 3 AM and 6 AM Eastern Time. Please
call
back during our operational hours.
4.
Thank you.
Good-bye."
DESIGN OF TALKER POOL
We contracted with The NPD Group, Inc., a marketing research firm
which
maintains a diverse panel of households in the 50 United States, to
provide
us a demographically-balanced sample of adult American talkers.
Our specification was that they send out lists in such a way that the
demographic variations in response rate would generate for us a set
of at
least 1060 talkers (ten per list), which is, in order of priority:
-- balanced according to gender (50-50), and
-- demographically representative of the United States according
to:
*
*
*
*
*
*
geography (state)
degree of urbanness of area
age, excluding below 18
income
level of education
socio-economic status
Unfortunately, native language is not a part of NPD's database, and
so
filtering out of non-native speakers of American English was done
only at
the verification stage.
5.
RECRUITING TIMETABLE AND DATA COLLECTION
Four mailings totaling 7683 letters were mailed out as follows:
4/ 8/94
4/29/94
7/ 1/94
8/12/94
9/ 2/94
1272
2650
2035
990
736.
While we enjoyed a 44% response rate, higher than anticipated, we
suffered
a 59% talker rejection rate, far in excess of what others have found
in the
past, owing to the difficulty of the material and the greater
scrutiny to
be applied to the utterances -- verifying a particular pronunciation
for
each word rather than merely any accepted one. Thus, we needed this
many
letters to reach 1358 callers, and 93,667 isolated-word utterances.
We
went beyond our initial goal in an attempt to obtain at least 10
utterances
of as many words as possible.
6.
DATA FORMATTING AND HEADER CREATION
For read isolated words, any silence exceeding 300 ms on either side
of
the segment the transcribers marked as speech was eliminated.
Spontaneousspeech files were not trimmed.
Filenames for the isolated words are of the format:
u<xxxx>_<gender>_<orthography>.wav
where <xxxx> is a four-digit unique identification number for the
talker,
<gender> is "m" or "f", and <orthography> is the orthography of the
word, in
all lower-case, with apostrophes replaced by "+"s.
Filenames for the spontaneous utterances are of the format:
u<xxxx>_<gender>_<type>.wav
where <xxxx> and <gender> are as above, and <type> is "zip_code",
"phone_num", "long_dist" or "money".
Each utterance file has a 1024-byte SPHERE header formatted as
follows
(comments within [] are not part of header):
Isolated words:
database_id PHONEBOOK
sample_rate 8000
list_number 602
[all even numbers from 602 to 812]
orthography staying
phon_trans st1exG
[See section 1C. The following entry indicates a
word-boundary/consonant-sequence sequence /st/, a
consonant-sequence/syllable-nucleus sequence /ste/ with the /e/ having
primary
stress, a nucleus/nucleus sequence primary-stressed /e/ followed by
schwa, a
nucleus/consonant-sequence sequence schwa followed by
/G/, a consonant-sequence/word-boundary sequence /G/, a "triphone"
consisting
of word-initial /st/, triphones /ste/, /tex/, /exG/, and a "triphone"
consisting of word-final /xG/. This line can reach ~200 characters for a
long word.]
phon_reason BC-st,CN-st1e,NN-1ex,NC-xG,CB-G,T-#st,T-ste,T-tex,TexG,T-xG#
sample_count 11779
sample_n_bytes 1
[1 byte per sample]
channel_count 1
[recorded on channel 1 of recording
platform]
sample_coding ulaw
[mu-law representation of waveform]
sample_byte_format 1
sample_sig_bits 8
[8 significant bits in a sample]
sample_checksum 17834
database_version 1.0
begin_time 0.300000 [speech begins 0.3 sec. into the provided
waveform]
end_time 1.172000
[speech ends 1.172 sec. into the provided
waveform]
Spontaneous speech:
database_id PHONEBOOK
database_version 1.0
sample_count 64000
sample_n_bytes 1
channel_count 1
sample_byte_format 1
sample_rate 8000
sample_coding ulaw
sample_checksum 34600
orthography (sigh)_an_amount_of_money_(sigh)_(lgh)_(ext)
begin_time 0.548750
end_time 1.924750
7.
VERIFICATION AND TRANSCRIPTION
While PhoneBook shares many utterance verification criteria with
other
speech data collection efforts, one criterion which for this project
stands
out in importance and, consequently, detail, is verification of
correct
pronunciation. Central to PhoneBook fulfilling its purpose as a
phonetically-rich training database is our ability to capture
legitimate
American English PHONETIC and PHONOLOGICAL VARIATIONS that different
speakers produce as realizations of the SAME PHONEMIC target. Thus,
we
are looking to capture regional/dialect variabilities such as how
often
talkers reduce a schwa-nasal sequence to a syllabic nasal, how often
and to
what degree talkers reduce an intervocalic alveolar stop into a flap,
what
"vowel color" talkers produce for syllable nuclei such as /c/ and
/ar/,
etc., while avoiding conflating these variations with phonemic
variations
such as whether the second syllable in "tomato" has an /e/ or an /a/.
Therefore, as described in section 1B, every effort was made to
exclude
words with multiple commonly-accepted phonemic sequences.
Transcribers were
educated as to the difference between phonemic and phonetic
variation, and
were provided with our anticipated phonemic transcription of each
word at
verification time. They were told to reject a word on the grounds of
mispronunciation only in the event that what they heard did not
represent
a reasonable phonetic realization of the phoneme sequence they saw.
By
this criterion, when an unanticipated alternate phonemic form was
discovered
at verification time, a reasonably-pronounced utterance was rejected
as
"mispronunced" because it did not represent the phoneme sequence for
which
its word was included into PhoneBook.
A.
Talker-rejection criteria
Transcribers rejected a talker upon detecting any of the following:
1.
2.
3.
4.
Foreign accent
Speech impediment
Persistent substantial background noise
A large number of problematic utterances by the same talker
(approximately 10).
In these cases, the talker and all of his/her utterances have been
eliminated from PhoneBook.
B.
Read-isolated-word-utterance rejection criteria
Transcribers rejected a core PhoneBook utterance upon detecting any
of:
1.
Truncation at beginning or end, indicating talker spoke too
soon
or spoke beyond the time limit for that response
2.
correct
3.
4.
C.
Extraneous words or syllables, regardless of proximity to
word
Substantial background noise
"Mispronunciation" as defined above.
Spontaneous-utterance rejection criteria
Transcribers rejected a spontaneous utterance only when the recording
did not contain speech.
D.
Spontaneous-utterance transcription procedure
Transcribers were told to type exactly what was said, all in lower
case,
spelling out numbers without hyphens, and using no punctuation except
for
apostrophes when the talker said a contraction or a possessive.
When a portion of a word was unclear, it was transcribed within
square
brackets if determinable ("twen[ty]") or labeled as "[?]" if not.
When background noise(s) overlapped speech, "(ext-b)" was marked
before the
first affected word and "(ext-e)" was marked after the last affected
word
for each contiguous passage of noise-affected speech regardless of
how many
noise sources were involved or the degree to which they overlapped
each
other.
Markers for other speech and non-speech events are as follows:
(uni)
unintelligible speech (other than fragments of
partially-recognizable words)
(?word)
speech was not definitely intelligible but was
probably "word"
(for)
foreign speech
(uh)
oral hesitation filler
(mm)
nasal hesitation filler
(um)
oral-then-nasal hesitation filler
(obs)
obscenity
(lgh)
laughing
(thr)
throat clearing, coughing or sniffle
(sigh)
sighing or loud breathing
(esp)
speech produced by the talker but not directed to telephone
(hang-up) hang-up clicks
Download