DISCOVERING PHONEMIC BASE AN by John M. Lucassen

advertisement
DISCOVERING
PHONEMIC BASE FORMS AUTOMATICALLY -
AN INFORMATION THEORETIC APPROACH
by
John M.
Lucassen
Submitted to the Department of Electrical Engineering and
Computer Science in Partial Fulfillment of the Requirements
for the Degrees of
BACHELOR OF SCIENCE
and
MASTER OF SCIENCE
at the
MASSACHUSETTS INSTITUTE OF TECHNOLOGY
February 16, 1983
C 1983 John M. Lucassen
The author hereby grants to MIT permission to reproduce and to distribute copies of this thesis document in whole or in part.
Signature of Author
Department of Electrical Engineering
and Computer Science, January 15, 1983
(
Certif ied by
11'
Professor Jonathan Allen,
Academic Thesis Supervisor
Certified by
Dr. Lalit R.
Bahl,
Company Thesis Supervisor
Accepted by
Chairman,
Profe'sor Arthur C. Smith,
Department Committee on Graduate Studies
MASSACN3ETS INSTITUTE
OF TECHNOLOGY
MAY 2 7 1983
Archives
LIBRARIES
ABSTRACT
Discovering Phonemic Base Forms Automatically - an Information Theoretic
Approach
by
John M. Lucassen
Submitted to the
puter Science on
Requirements for
of Science at the
Department of Electrical Engineering and ComJanuary 15, 1983 in Partial Fulfillment of the
the Degrees of Bachelor of Science and Master
Massachusetts Institute of Technology
Information Theory is applied to the formulation of a set of probabilistic
spelling-to-sound rules. An algorithm is implemented to apply these
rules.
The rule discovery process is automatic and self-organized, and
uses a minimum of explicit domain-specific knowledge.
The above system is
interfaced to an existing continuous-speech
recognizer, which has been modified to perform phone recognition.
The
resulting system can be used to determine the phonemic base form of a word
from its spelling and a sample utterance, automatically (without expert
intervention). One possible application is the generation of base forms
for use in a speech recognition system.
All the algorithms employed seek minimum entropy or maximum likelihood
solutions.
Sub-optimal algorithms are used where no practical optimal
algorithms could be found.
Thesis Supervisor:
Title:
Jonathan Allen
Professor of Electrical Engineering
Index Terms:
Spelling
to
Sound
Rules,
Information
Theory,
Self-organized Pattern Recognition, Efficient Decision Trees, Maximum Likelihood Estimation, Phone
Recognition.
Abstract
ii
CONTENTS
1.0
1.1
1.2
1.3
1.4
1.5
1.6
Introduction
. . . . . . . .
Background
. . . . . . . . .
Objective
. . . . . . . . . .
The Channel Model
. . . . . .
Feature Set Selection
. . . .
Data Collection
. . . . . . .
Utilizing the Sample Utterance
2.0
The Channel Model
2.1
2.2
2.3
2.4
Specification of the Channel Model
.
Imposing Structure on the Model
. . .
Related Work
. . . . . . . . . . . .
Summary
. . . . . . . . . . . . . . .
3.0
3.1
3.2
3.3
3.4
3.5
3.6
3.7
3.8
Feature Set Selection
. . . . . . .
Introduction
. . . . . . . . . . .
Objectives
. . . . . . . . . . . .
Related Work
. . . . . . . . . . .
Binary questions
. . . . . . . . .
Towards an Optimal Feature Set
. .
Using Nearby Letters and Phones Only
Towards a Set of 'Good' Questions
.
Details of Feature Set Selection
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Decision Tree Design
4.1
4.2
4.3
4.4
4.5
4.6
4.7
Related Work
. . . . . . . . . . .
Constructing a Binary Decision Tree
Selecting Decision Functions
. . .
Defining the Tree Fringe
. . . . .
An Efficient Implementation of the BQF
Analysis of Decision Trees
. . . .
Determining Leaf Distributions
. .
5.0
Using the Channel Model
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
1
1
2
5
8
9
10
11
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
11
13
19
19
.
.
.
.
.
.
.
. .
. .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
21
21
22
23
24
27
29
33
36
.
.
.
.
.
.
. . . . . . . . . . . . . . . . . . . .
. . . . .
. . . . .
. . . . .
. . . . .
Algorithm
. . . . .
. . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
. . .
. . .
. . . . . . . . . .. .
. . .
.. . .
. . .
. . .
. . .
. . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
. . . ...
. .
...
. .
..
. .
. .
. . . . .
itself
.
by itself
. . . .
. . . . .
. . . . .
. . . . .
68
. . . . .
... . ..
. . . . .
. . . . .
. . . . .
. . . . .
. . .
. .
. .
. . .
. ...
..
.
. . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
43
. 44
. 46
. 47
. 50
. 53
. 59
. 62
.
.
.
.
.
. . . . . . . . . . . . . . . . . . . . . . . . .
6.1 Performance
. . . . . . . . . . . . .
6.1.1 Performance of the Channel Model by
6.1.2 Performance of the Phone Recognizer
6.1.3 Performance of the complete system
6.2 Objectives Achieved
. . . . . . . . .
6.3 Limitations of the system
. . . . . .
6.4 Suggestions for further research ..
Contents
.
.
.
.
.
.
.
.
.
.
.
5.1 A Simplified Interface
. . . . . . .
5.1.1 Paths
. .. .
.*.. *.
.
.
5.1.2 Subpaths
. . . . . . . . . . . .
5.1.3 Summary of Path Extension
. . . .
5.2 A Simple Base Form Predictor
. . . .
5.3 The Base Form decoder
. . . . . . . .
Evaluation
.
.
.
.
.
.
.
. . . . . . . . . . . . . . . . . . . . .
4.0
6.0
.
.
.
.
.
.
.
. .
..
. .
. .
. .
. .
..
68
69
70
74
76
77
80
80
80
82
83
85
87
90
iii
A.0
References
. . . . . . . .
92
. . . . . . . . . . . . . . . . .
94
B.0
Dictionary Pre-processing
B.1
B.2
B.3
B.4
B.5
Types of Pre-processing Needed
. . . . . . .
The trivial Spelling-to-baseform Channel Model
Base form Match Score Determination
. . . . .
Base form Completion
. . . . . . . . . . . .
Base form Alignment
. . . . . . . . . . . . .
.
.
.
.
.
.
94
95
96
97
97
C.0
The Clustering Algorithm
. . . . . . . .
98
C.1
C.2
Objective
. . . . . . . . . . . . . . . . . . . . ..
. . . .
Implementation
. . ..
. . . . . . . . . . . . . . . . . . .
98
99
. .
. . . . . ..
D.0
List of Features Used
D.1
D.2
D.3
D.4
Questions
Questions
Questions
Questions
E.0
Overview of the Decision Tree Design Program...
F.0
Evaluation Results
Contents
about
about
about
about
.
.
. .
. .
. .
. . . . . . . . . . . . ..
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
. . . . . ..
. ..
.
.
.
.
.
. . . .
the current letter
. . . . . . . . . .
the letter at offset -1
. . . ..
. . .
the letter at offset +1
. . . . . . . .
phones
. . . . . . . . . . . . . . . .
. ..
.
.
.
.
.
.
.
.
.
.
.
.
.
101
101
102
104
105
. . . .
108
. . . . . . .
109
iv
LIST OF ILLUSTRATIONS
Figure
Figure
Figure
Figure
Figure
Figure
Figure
Figure
Figure
Figure
Figure
Figure
Figure
Figure
Figure
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
The Spelling to Baseform Channel
. . . . . . . . . .
The Sequential Correspondence between Letters and Phones
Another View of the Channel
. . . . . . . . . . . . .
A More Complicated Alignment
. . . . . . . . . . . .
Different Alignments
. . . . . . . . . . . . . . . .
Information Contributed by Letters and Phones
. . . .
Information Contributed by Letters, Cumulative
. . .
Information Contributed by Phones, Cumulative
. . . .
The Best Single Question about the Current Letter
. .
The Best Second Question about the Current Letter
. .
Number of Questions asked about Letters and Phones
.
Coefficient Sets and their Convergence
. . . . . . .
Typical Channel Model Errors
. . . . . . . . . . . .
Typical Phone Recognizer Errors
. . . . . . . . . . .
Partially Specified Pronunciations
. . . . . . . . .
List of Illustrations
.
.
.
.
.
.
.
.
.
.
.
.
.
.
6
7,
14
15
18
29
31
32
35
35
62
67
82
84
95
v
PREFACE
The work presented in this thesis was done while the author was with the
Speech Processing Group at the IBM Thomas J. Watson Research Center. Much
of the work done by the Speech Processing Group utilizes methods from the
domain of Information Theory [Bahl75] [Jeli75] [Jeli76] [Jeli80].
The
project presented here was approached in the same spirit.
The project was initiated in 1980, as a first attempt to make use of the
pronunciation information contained in a large on-line dictionary (70,000
entries).
The decision to use an automatic decision tree construction
method to find spelling-to-sound rules was due to Bob Mercer.
To some
extent the project was a feasibility study: at the outset, we did not know
how difficult the task would be.
A persistent effort was made not to introduce explicit domain-specific
knowledge and heuristics into the system.
Consequently, the system is
highly self-organized, and adaptable to different applications.
Several aspects of the project called for the formulation and efficient
implementation of some basic algorithms in Information Theory. The
resulting programs can, and have been, be used for other purposes.
ACKNOWLEDGEMENTS
I would like to thank the members of the Speech Recognition Research
Group, who were always willing to answer questions, to lend a helping
hand.
I will not mention everyone by name - you know who you are.
In particular, I would like to thank Lalit Bahl and Bob Mercer, whose
insights and ideas were immensely helpful at all stages of the project.
Finally, I would like to thank my advisor, professor Jonathan Allen, for
his support; and John Tucker, the director of the MIT VI-A Coop Program,
for making my work at IBM possible.
Preface
vi
1.0
INTRODUCTION
This chapter gives an overview of the setting in which the project was
conceived
and the line of reasoning that determined its shape.
gives a brief description of each of the major portions of the work.
intended
It also
It is
to aid the reader in developing a frame of mind suitable for
reading the remaining chapters.
1.1
BACKGROUND
The (phonemic) base form of a word is a string of phones or phonemes that
describes the 'normal'
It
pronunciation of the word at some level of detail.
is called the 'base form'
phonetic
realization
of
to distinguish it
the word.
from the 'surface form' or
Phonemic base forms are useful for
several reasons:
*
From the base form of a word, one can derive most alternate pronunciations of the word by the application of phonological rules [Cohe75].
*
Base
forms reflect similarities between words and between parts of
words at a more fundamental level than surface forms.
Because the number of different phones is much smaller than the number of
different words, these similarities between words at the phone level can
be
exploited
recognition
in
a number
of ways.
In particular,
phone-based speech
systems may use them to model coarticulation phenomena,
facilitate the adaptation to a new speaker,
to
and to reduce the amount of
computation needed for recognition [Jeli75].
Introduction
1
Phonemic base forms may be obtained in a variety of ways.
aries
contain
pronunciation
information
from
which base forms can be
the correct pronunciation of a
derived.
Alternatively, anyone who 'knows'
word
has some knowledge of phonetics can determine
and
However,
it
is not generally possible to determine
unknown word by listening to how it
Many diction-
its base
the base
form.
form of an
is pronounced, because information may
be lost in the transformation from base form to surface form.
Base
forms
detail,
(and
surface
forms)
to suit their application.
may be expressed at various
For example,
some applications need to
distinguish three or more different levels of stress on vowels,
need only one or two; for some applications it
between
different
unstressed
vowels,
levels of
while some
is necessary to distinguish
for other applications this dis-
tinction is not important.
Because each application sets its own standard regarding the amount of
detail specified in a base form, different applications may not be able to
share base forms,
and it
may be necessary to generate a new set of base
forms for each new application.
1.2
OBJECTIVE
The
objective of this project was to find some way of generating base
forms for new words automatically (i.e., without expert intervention).
*
The main application we envision is enabling users of a phone-based
speech recognition system to add new words (of their own choice)
to
the vocabulary of the system.
*
A second application is the verification and/or correction of existing base forms.
Introduction
2
*
A third application is the custom-tailoring of base forms to individual speakers, to overcome the effects of speaker idiosyncrasies.
Based on these intended applications, the following objectives were formulated:
1.
The system should use only information that can be provided by a typical user of a speech recognition system, such as
*
a few samples of the correct pronunciation of a word, possibly by
different speakers
2.
*
the spelling of the word
*
the part-of-speech of the word
*
sample sentences in which the word occurs.
The system should be adaptable to a particular convention regarding
the amount of detail specified in the base forms.
In order to satisfy the first objective, we decided to use the following
sources
of
information:
the
spelling
of a word,
and a single sample
utterance.
The spelling of the word is used because (1) it is available in all the
applications we presently anticipate,
and (2)
it
contains a lot of infor-
mation about the pronunciation of the word (although perhaps less so in
English than in some other languages).
We decided to use a sample utterance as well, because the spelling of a
word by itself does not contain enough information to determine its pronunciation.
For example,
the pronunciation of the word 'READ'
its tense as a verb; further examples include the words 'NUMBER'
adjective),
'CONTRACT'
(verb
or
noun)
and
'UNIONIZED'
depends on
(noun or
(union-ized or
un-ionized).
Introduction
3
We will use only one sample of the pronunciation because this is the simplest and least expensive option.
Since we are using the pronunciation of a word as a source of information,
one may wonder why it
is necessary to use the spelling of the word as well.
There are two reasons for this:
1.
we do not presently have an automatic phone recognizer that is sufficiently reliable, and
2.
as pointed out before,
form,
and it
a sample utterance corresponds to a surface
is not generally possible to infer the corresponding base
form from this surface form.
There are a variety of potential sources of information which we do not
use at the present time.
the
capitalization
of
These include the part-of-speech of the word,
the
word,
and
so
on.
Knowledge
of
the
part-of-speech of a word can help to resolve ambiguities in the base form,
especially in stress assignment.
ter of a word is capitalized,
entirely in upper case)
since it
names,
Case information (whether the first let-
for example,
or whether the word is spelled
can also provide clues about the pronunciation,
appears that different spelling-to-sound rules apply to proper
acronyms and abbreviations
than to
'regular' words.
not presently use these types of information,
While we do
we do not preclude using
them in the future.
The
second
objective
(adaptability)
is
satisfied
by
the
use
of
self-organized methods wherever possible.
Introduction
4
1.3
THE CHANNEL MODEL
We view the task of predicting the possible base forms of a word from its
We will define a hypothetical trans-
spelling as a communication problem.
and construct a model
formation that turns letter strings into base forms,
of this transformation, by means of which we can determine the likelihood
that a particular letter string produced a certain base form.
1
This transformation is presumed to take place as a result of transmission
of a letter string through a channel that garbles the letter string and
produces a phone string as output
model
of
this
(see Figure 1 on page 6).
Thus,
the
will be referred to as a channel model.
transformation
Information and communication theory provide us with a variety of tools
for constructing such models.
There are a variety of ways to define our channel model. In the interest
of simplicity, we have based our model on the assumption that each letter
in the spelling of a word corresponds to some identifiable portion of the
base
form,
and that these correspondences
in Figure 2
illustrated
on
page
7
for
are in sequential order,
the words
'BASE'
and 'FORM'.
as
2
Because of this sequential correspondence, the channel model can operate
It may be that in reality,
no such transformation takes place. For
example, it may be that letter strings are determined by base forms,
or
that
both are determined by something else,
perhaps a morpheme
string.
2
I will use upper case and quotes to denote letters
lower
case and
strings
strokes
are presented
[Cohe75] (p. 309).
to
denote phonemes
(/ei/,
in a notation similar
to
('A',
/b
that
'B'),
ee/).
and
Phone
defined
in
Since all phone strings in this thesis are accompa-
nied by the corresponding letter strings, I do not provide a translation table here.
Introduction
5
Letter string
Phone string
(channel)
11,
12313' '.
1n
.1 .
2'x3'
T A B L E
>
Figure 1.
The
Spelling
to
t ei b uh 1
Baseform
Channel:
channel, a string of n letters,
output of the channel,
' '1m
The
input
to
the
is shown on the left.
The
a string of
m phones,
is shown on
the right.
as follows:
for each letter that enters the channel, a sequence of phones,
representing the pronunciation of that letter, emerges at the other end.
The pronunciation of each letter will generally consist of no more than
one or two phones,
and it
is empty in the case of silent letters.
Since the pronunciation of a letter often depends on the context in which
it
occurs,
it
is imperative that the model be allowed to inspect this con-
text before it predicts the pronunciation of a letter. This implies that
the model must have memory.
ther by allowing it
has
The power of the model can be increased fur-
to remember the pronunciations of the letters that it
already predicted.
Thus,
the following information is taken into
account by the channel model as it predicts the pronunciation of a given
letter in a word:
to
its
right,
Hereafter,
the letter itself, any letters to its left, any letters
and
the
pronunciation
of
the
letters
I will refer to this information as the 'context'
to
the
left.
that produces
the pronunciation; I will refer to the letter of which the pronunciation
is being predicted as the 'current
letter',
and to its pronunciation as
the pronunciation of the current letter or the 'pronunciation of the context'.
Introduction
6
B
/b/
A
/ei/
S
/s/
E
//
F
/f/
0
/aw/
R
M
The Sequential Correspondence between Letters and Phones
Figure 2.
The channel model needs to predict the pronunciation3 of individual letters of a letter string.
Because the language is not finite, and because
we have a limited amount of sample data, we will construct a model that is
able to generalize:
a model that can predict the pronunciations of con-
texts that have never been seen before.
Such a model can operate by pre-
dicting the pronunciation of a context on the basis of familiar patterns
within that context.
The specification of such a pattern together with a
probability distribution over the pronunciation alphabet is a spelling to
pronunciation
rule.
In
"Decision Tree Design" on page 43 ff.,
I will
describe an algorithm that can find these spelling to pronunciation rules
automatically, based on a statistical analysis of a sufficient amount of
'training
data'.
The rules will be expressed in the form of a decision
tree.
3
Read this as "determine the probability distribution over the set of
possible pronunciations"
Introduction
7
1.4
FEATURE SET SELECTION
The decision tree design program
most important,
it
is subject to a number of limitations;
requires that the set of features out of which rules
may be constructed be specified in advance, and that this set be limited
in size.
The choice of this feature set is of profound importance,
only features
since
included in this set can be used to formulate rules.
The
feature set should satisfy the following, conflicting, requirements:
1.
it
should be small enough to allow the tree design program to operate
within our resource limits;
2.
it
should be sufficiently general,
so that any spelling to pronuncia-
tion rule can be expressed effectively in terms of the features in the
feature set.
We can compromise by using a relatively small number of features that are
'primitive' in the sense that any other feature can be expressed in
terms
of them.
More specifically, we will represent each letter and phone of a
context in terms of a small number of binary features.
These features are
answers to questions of the form 'is the next letter a vowel'
previous phone a dental'.
Features are selected on the basis of the Mutu-
al Information between the feature values
current
and 'is the
and the pronunciation
of the
letter.
The feature selection process is described in detail in
"Feature Set Selection" on page 21 ff. .4
4
[Gall68] gives a comprehensive overview of Information Theory.
Introduction
8
1.5
DATA COLLECTION
The tasks of feature selection and decision tree design are accomplished
by means of self-organized methods.
Consequently,
we need to have access
to a sufficient amount of sample data for training.
Because the channel
model is defined in terms of letters and their pronunciation,
data must also be in this form.
the training
This means that the training data must
consist of a collection of aligned spelling and base form pairs,
in which
the correspondence between letters and phones is already indicated.
Training data of this sort is not generally available:
dictionary
gives the pronunciation of most words,
while a typical
these pronunciations
are not broken down into units that correspond to the individual letters
of the word. 5
Consequently, we have had to align all the spelling and base
form pairs used for training.
large (70,000),
Since the number of such pairs is rather
this could not be done by hand; instead, we used a simple,
self-organized model of letter to base form correspondences (designed and
built by R.
Mercer)
to perform this alignment.
The alignment process is
described in "Base form Alignment" on page 97 ff. .
Two sources of spelling and base form information were used:
an on-line
dictionary, and the base form vocabulary of an existing speech recognition
system.
The data from the dictionary had to be processed before it
be used, because it
contained incomplete pronunciations,
could
in entries of the
following form:
tabular /t
tabulate /-
s
ae - b j uh - 1 uh r/
1 ei t/
This is not necessary since it appears that people can determine the
correct alignment, when needed, without difficulty.
Introduction
9
and because it
did not contain the pronunciations
forms
plurals,
such
as
superlatives,
dictionary,
past
and so forth.
which
had
of regular
tenses and participles,
Furthermore,
to be
eliminated.
inflected
comparatives
and
there were some errors in the
See "Types
of Pre-processing
Needed" on page 94 ff..
1.6
UTILIZING THE SAMPLE UTTERANCE
The Channel Model has been interfaced to a phone recognizer,
referred to as the 'Linguistic
Decoder'.
hereafter
A maximum-likelihood decoder is
used to identify the base form that is most likely to be correct given the
spelling of the current word and the sample utterance.
The phone recognizer
is simply a suitably modified version of the con-
nected speech recognition system of [Jeli751, and is not central to this
thesis.
The
incorporation
maximum-likelihood
of
the
Channel
Model
into
the
decoder is described in "Using the Channel Model" on
page 68 ff..
Introduction
10
2.0
THE CHANNEL MODEL
2.1
SPECIFICATION OF THE CHANNEL MODEL
This section describes the general properties of the Channel Model and its
implementation.
The
Channel
Model
was defined with two objectives
in
mind:
1.
the model should be simple,
so that its structure can be determined
automatically;
the model should be as accurate as possible.
2.
The channel model operates on a word from left to right, predicting phones
in left-to-right order.
This decision was made after preliminary testing
failed to indicate any advantage to either direction.
The left-to-right
direction was chosen to simplify the interface to the linguistic decoder,
as well as because of its intuitive appeal.
The
Channel
that we seek to model is nondeterministic:
string (such as 'READ')
a given letter
may be pronounced differently at different times.
Thus, the Channel Model is a probabilistic model: given a letter string,
it
assigns a probability to each possible pronunciation of that string.
The mapping from letter strings to phone strings is context sensitive: the
pronunciation
Therefore,
of
a letter depends on the context
in which it
the channel model has to be able to 'look
letters and to 'look ahead'
to future letters.
back'
appears.
to previous
The fact that channel mod-
el is allowed to 'look back' implies that the model has memory. The fact
that it
is allowed to 'look ahead'
The Channel Model
further implies that it
operates with a
11
delay.
The look-back and look-ahead is limited to the letters of the word
under consideration.
Consequently,
the base form of each word is pre-
dicted without regard to the context in which the word appears.
Since the likelihood of a particular pronunciation for a letter depends on
the pronunciation of preceding letters, we also allow the channel model to
'look
back'
at
the pronunciation
of
previous
letters.
This type of
look-back is necessary for the model to predict consistent strings. Let me
illustrate this with an example:
consider the word 'CONTRACT',
and its
two pronunciations6
C
0
N
T
R
A
(1)
/k 'aa n
t
r
ae k
t/
(noun)
(2)
/k
t
r 'ae k
t/
(verb)
uh n
C
T
Since the channel operates from left to right, the correct pronunciation
of the 'A'
depends on the pronunciation assigned to the '0'. If the model
is unable to recall the pronunciation decisions made earlier on, it will
also
predict the following pronunciations,
each of which is internally
inconsistent:
C
0
N
T
R
(3)
/k 'aa n
t
r 'ae k
t/
(two stressed syllables)
(4)
/k
t
r
ae k
t/
(no stress at all)
uh n
A
C
T
By allowing the model to 'look back' to pronunciation decisins made earlier on, we can avoid such inconsistencies.
6
The symbol /'/
This capability also allows
(single quote) denotes primary stress; the symbol /./
(period) will be used to indicate secondary stress.
The Channel Model
12
the channel model to use knowledge about the permissibility of particular
phone sequences in its predictions - thus, the model can avoid producing
phone sequences that cannot occur in normal English.
Finally, we will model the channel as being causal. This implies that the
channel
future
model
can not
use
any information about the pronunciation of
letters in determining the pronunciation of the current letter.
Without this restriction, we would be unable to compute the likelihood of
a complete pronunciation string from the individual probabilities of the
constituent pronunciations.
2.2
IMPOSING STRUCTURE ON THE MODEL
To simplify the channel model, we view the channel as if it predicts the
base form for a letter string by predicting the pronunciation of each letter separately (see Figure 3 on page 14).
The pronunciations of the indi-
vidual letters of the word are then strung together to form the base form
of the word.
Motivation
This simplification can be made because,
in general,
each letter in the
spelling of a word corresponds to some identifiable portion of the base
form, and that these correspondences are in sequential order.
to this rule will be discussed shortly).
(Exceptions
This simplification allows us to
decompose our original task into two independent and presumably simpler
sub-tasks:
The Channel Model
13
Letter string
Phone string
(channel)
L ,
3
/x2 1,
L
n
...
x
/
/xnl' ... ' xnk
/eh/
E
X
/k s/
T
/t/
/ r/
R
A
Figure 3.
/uh/
Another View of the Channel:
string of n letters,
the channel,
n groups,
The input to the channel, a
is shown on the left. The output of
a string of
m phones, has been organized into
one group of phones
for each letter.
Some of
these groups may be empty; others may contain more than one
phone.
1. to identify the likely pronunciations of each letter of the word (and
compute their probabilities),
2.
and
to combine the results and identify the likely base forms of the word
(and compute their probabilities).
The word is decomposed into letters because this is the simplest possible
decomposition:
any decomposition based upon higher-level
units such as
morphemes would require a method for segmenting the letter string, which
would add to the complexity of the model.
The Channel Model
14
Letter string
Phone string
(channel)
>
T
A
/ei/
B
/b/
L
/uh 1/
/
E
Figure 4.
A
More
Complicated
Alignment:
The
aligned with the 'L',
and the 'E'
is silent.
segment
/uh
1/ is
Limitations
The
above
decomposition
over-simplification.
form,
/t
/t
it
the
problem
is not clear how to match up 'LE'
are relatively unimportant,
it
if
based
on
an
obvious
and its base
While we have no difficulty matching up 'TAB'
the sequential nature of the alignment.
sistent:
is
Consider, for example, the word 'TABLE'
ei b uh 1/.
ei b/,
of
it
and /uh
to
1/ while preserving
Although the details of alignment
is important that the alignments be con-
the sample letter- and phone strings are aligned haphazardly,
will be difficult to determine the corresponding spelling to base form
rules.
By using machine-generated alignments, we can be assured of some
degree
of
consistency.
The
alignment program,
which was designed and
implemented by R. Mercer,
does produce the 'correct'
trated
alignment
in Figure 4.
The
process
alignment,
as illus-
is described iri "Base form
Alignment" on page 97 ff..
A second source of disagreement between the 'real' spelling to base form
channel and our channel model is found in the treatment of digraphs such
as 'EA', 'OU', 'TH' and 'NG' and repeated letters such as 'TT' and 'EE',
which often correspond to a single phone.
The Channel Model
Since the model is capable only
15
of
predicting the pronunciation of individual
letters,
it
predicts
the
pronunciation of such a digraph as if the phone corresponds to one of the
letters
and
the other
letter
is
silent.
The main
drawback of this
approach is that the model must decide on the pronunciation of each letter
separately, thereby increasing the computational workload and the opportunity to make errors.
On the other hand,
it
eliminates the need to seg-
ment the letter string.
A
further
limitation
of
the simplified model is that it
can not deal
straightforwardly with words in which the correspondence between letters
and phones is not sequential, such as:
'COMFORTABLE' (with base form /k 'uh m f t er b uh 1/) and
'NUCLEAR'
(with the non-standard base form /n 'uu k j uh 1 er/).
While the channel model is in principle capable of dealing with rules of
any complexity, complex rules such as those involving transpositions may
be hard to discover.
Fortunately,
the correspondence between letters and
phones is sequential in most words.
Implementation of the Channel Model
Every prediction made by the Channel Model is based upon knowledge of the
context of the current letter:
*
the current letter itself,
*
the letters that precede and follow it
(up to the nearest word bounda-
ry), and
*
the pronunciation of preceding letters.
The Channel Model
16
The problem of predicting the pronunciation associated with a particular
context is a pattern recognition problem.
domain [Case8l] [Meis72]
We
will
This is a well-studied problem
[Stof74].
use a decision tree as classifier.
normally associated with decision
trees
Aside from the advantages
[Meis72]
[Payn77],
we will take
advantage of the fact that a decision tree can be constructed automatically.
We
have
implemented an algorithm to perform this task.
It is
described in detail in "Decision Tree Design" on page 43 ff..
Combining Pronunciations into Base forms
Because the channel is causal,
the probability of a particular pronuncia-
tion string is equal to the product of the probabilities that each of the
letters is pronounced correctly:
n
p(X
I L)
=
H p(x.
I L, x,
x 2 , ...
,
x.
1
)
i=1
where X is a pronunciation string, L is a letter string, and x. is the i'th
element of X.
However,
we are not interested in the probability of a particular pronun-
ciation string - we are interested in the probability of a particular base
form.
A base form may correspond to many different pronunciation strings,
each of which represents
a different way of segmenting of the base form
into pronunciations (see Figure 5 on page 18).
Thus, the probability of a
particular base form is the sum of the probabilities of all the pronunciation strings that are representations of this base form.
Fortunately, the
number of plausible alignments of a word is usually small,
The Channel Model
so that we can
17
Letter string
Phone string
(channel)
B
/b/
(1)
/ou/
0
A
//
/t-
T
B
(2)
/b/0
//
/ou/
A
T
/t/
/b ou/
B
/t/
//
//
0
A
T
Figure 5.
(3)
Different Alignments:
A single base form may correspond
to several different pronunciation
the
alignment
Alignment
(1)
between
is
the
strings,
letters
and
depending on
the
more plausible than alignments
phones.
(2)
or
(3).
calculate their probabilities without difficulty.
The number of align-
ments with non-negligible probabilities is determined to a large extent by
the consistency of the alignments in the training data.
The Channel Model
18
2.3
RELATED WORK
One difference between our design and typical existing spelling to sound
systems [Alle76]
[Elov76]
tionary of exceptions.
in
[Hunn76] is the fact that we do not use a dic-
Such a dictionary can be of great practical value
predicting the pronunciation of high-frequency English words,
since
these words tend not to follow the rules that apply to the remainder of
the English vocabulary.
ceptions',
1.
We draw no distinction between 'rules' and 'ex-
for several reasons:
Our system is intended mainly to predict the pronunciation of infrequently used words,
namely words that are not already included in the
standard vocabulary of a speech recognition system.
Thus,
there is no
need to devote extra attention to high-frequency words.
2.
By including exceptions in the probabilistic model,
capable
of
reproducing
any
the system will be
exceptional pronunciations
encountered in the training data without any special effort.
more,
that
were
Further-
the pattern matching power of the system is available to identi-
fy any derived forms to which an exception may apply.
3.
We are operating under the assumption that we have insufficient training data. Therefore, we may encounter exceptions that did not occur in
the training data.
By including known exceptions in the training data
for our probabilistic model,
we can obtain a better estimate of the
reliability of the model.
2.4
SUMMARY
Now that we have formulated our Channel modelling task as a decision tree
design problem, we are faced with several sub-problems, which are the sub-
The Channel Model
19
ject of the following chapters:
*
We need to select a set of primitive features
spelling
to
base
form
rules
will
be
in terms of which the
expressed
(see "Feature Set
Selection" on page 21 ff.);
*
We need to formulate the spelling to base form rules themselves - construct the decision tree and determine the probability distribution
(over the output alphabet)
associated with each leaf (see "Decision
Tree Design" on page 43 ff.);
Finally,
*
we need to construct an interface that allows communication between
the Channel Model and the Linguistic Decoder (see "Using the Channel
Model" on page 68 ff.).
The Channel Model
20
3.0
FEATURE SET SELECTION
3.1
INTRODUCTION
In the preceding chapter, we motivated our decision to construct a decision tree with spelling to base form rules.
Unfortunately, there appears
to be no practical method for constructing an optimal decision tree for
this problem with our limited computational resources [Hyaf76].
therefore
use
a
sub-optimal
tree
construction
algorithm,
We will
which
is
described in detail in the next chapter.
This tree construction algorithm builds a decision tree in which all rules
are of the form 'if A and B ...
tures of a context,
specifies
and C then X',
where A.. .C are binary fea-
chosen from a finite set,
the feature set, and 'X'
a probability distribution over the pronunciation alphabet. 7
The size of this feature set
the set is too small,
erful; if it
is subject to conflicting constraints:
the resulting family of rules will not be very pow-
is too large,
the tree construction program will consume too
much of our limited computing resources,
of dimensionality'
if
while being subject to the 'curse
(see [Meis72]).
The more difficult problem is to determine not how many,
but which fea-
tures to include in the feature set.
Although the size of the feature set
is determined manually and somewhat
arbitrarily, the selection of indi-
vidual features has been largely automated.
7
Since the presence or absence of a particular feature is ascertained
by
asking
'question'
a
corresponding
question,
the
terms
'feature'
and
may often be used interchangeably.
Feature Set Selection
21
3.2
OBJECTIVES
The feature set that is made available to the decision tree design program
should possess the following properties:
1.
All
the features
that are necessary for the construction of a good
decision tree should be present in the set.
It is clear that the pronunciation of an unknown letter in an unknown
context
is determined mostly by the identity of the letter itself.'
Thus, a good feature set will probably contain features by which the
current letter may be identified. In many cases, surrounding letters
also affect the pronunciation,
and this must be reflected in the fea-
ture set.
2.
There should be no features
in the feature set that are informative
but not general.
For example, suppose that each of the sample words in the dictionary
occupies a single line in the dictionary.
Clearly,
knowing the line
number at which a particular context occurs can be helpful in predicting
the
pronunciation
for that context.
However,
this information
does not generalize to words that do not occur in the dictionary, and
therefore does not contribute much to the power of the model.
3.
The feature set should not contain features that are not informative.
Uninformative features should be avoided for two reasons:
'
For a confirmation of this assertion, see Figure 6 on page 29.
Feature Set Selection
22
their
*
presence
increases
the amount of computation required to
design the decision tree, even if they are not used at all;
where the sample data is sparse, they may appear to be informative
*
(because of the 'curse of dimensionality'); however, if they are
ever incorporated in a decision tree, this tree will be inferior.
3.3
RELATED WORK
Several existing systems that predict pronunciation from spelling operate
on the basis of a collection of spelling to pronunciation rules [Alle76]
[Elov76].
To the extent that these systems operate on the letter string
sequentially (from left to right), they can be re-formulated in terms of a
decision tree similar to the one we use.
The rules that make up these sys-
tems are generally of the form
'In the context
aal,
is pronounced /6/',
where 5 is a letter (or a string of letters),
*
individual
letters,
sounds
a and I are either
or letter-sound combinations
('Z',
/r/,
'silent E'),
*
classes of letters, sounds or combinations thereof ('vowel',
'consonant cluster'), or
*
strings of the above ('a silent E followed by a consonant'),
and /6/
'T or K',
specifies a sequence of phones (possibly null).
These rules are very similar in structure to the rules that we seek to
discover.
Consequently,
Feature Set Selection
it
is
likely that some of the features used by
23
these rule-based systems would be good candidates for our feature set; the
converse may also be true.
Since we have made no particular attempt to
duplicate features used by other systems,
ture
sets
is either coincidental
any similarity between the fea-
or (more likely)
a confirmation that
these features are indeed meaningful.
3.4
BINARY QUESTIONS
Any question about a member of a finite set (a letter, a phone,
a pronun-
ciation) can be expressed as a partition of that set, such that each subset corresponds to a different answer to the question.
For example, the
answer to the question 'is the letter L a vowel' can be represented by a
partition of the letter alphabet into vowels and non-vowels.
9
For reasons described below, we will consider only binary questions (about
binary features) as candidates for the feature set.
done without loss of generality:
Note that this can be
just as any integer can be expressed as
a sequence of binary digits, any set membership question can be expressed
in terms of a set of binary questions.
To minimize fragmentation
The use of binary features makes it easier to construct a reliable deci-
9
Since any question about a letter of phone is equivalent to a corresponding partition of the letter- or phone alphabet, the terms 'question' and 'partition'
Feature Set Selection
may often be used interchangeably.
24
sion tree from a limited amount of sample data, because it minimizes the
amount of data fragmentation.
illustrated as follows:
Consider placing an n-ary test at some node in a
decision tree, given only
are n possible outcomes,
each outcome is
m/n.
The concept of data fragmentation can be
m sample contexts for that node.
Since there
the average number of samples corresponding to
The larger this number m/n, the easier it
is to make
statistically valid observations in the sample sets corresponding to each
of the possible outcomes.
Since m is given, we can maximize m/n by choos-
ing the smallest possible value for n, namely 2.
To simplify evaluation
Throughout feature selection and decision tree design, we need to compare
the merits of individual questions.
In general, the mutual information"
between the answer to a question and the unknown (pronunciation) which is
to
be
predicted is the most convenient standard of comparison.
It is
clear that a question with several possible outcomes can carry more information than a binary question.
10
However, this greater information content
The mutual information between two variables is defined in terms of
the entropy (H),
or mathematical uncertainty, of each of the variables
and the joint entropy of the two variables taken together:
MI(x ; y) E H(x) + H(y) - H(x, y).
It
represents the amount of information that each variable provides
about the other variable:
H(y I x) = H(y) - MI(x ; y).
The entropy of a random variable x with n possible values x * xn, with
respective probabilities p 1 . .pn, is defined as
n
H(x)
-
E p. x log& 2 ( p.
i=1
Feature Set Selection
25
is bought at a price:
the use of such questions,
cause unnecessary fragmentation.
Thus,
as pointed out above, can
we cannot compare questions with
different numbers of possible outcomes by simply measuring their information content:
we need an evaluation standard that balances the gain in
information against the cost of data fragmentation.
problem
by
limiting ourselves to features
We can avoid this
of a single degree,
such as
binary features.
To simplify enumeration
The number of different possible questions about a discrete random variable is equal to the number of different possible partitions of the set of
possible values of this variable. 11 This number is generally very large.
In fact,
even the set of all possible binary questions about letters or
phones, while much smaller, is too large to allow us to enumerate all such
possibilities.
However, it
is much easier to consider a significant frac-
tion of the possible binary partitions than it
is to consider a signif-
icant fraction of all possible partitions.
To deal with limited sample data
Because we have only a limited amount of sample data, the information con-
1
If a typical context consists of seven letters (a few on each side of
the 'current
letter',
and the current letter itself) and three phones
to the left of the current letter, the number of distinct contexts is
26 7 x50
3
=
1.0x10 1 5 . The number of different ways of partitioning a
set of this size into subsets is far too large to allow individual
consideration of every possibility.
Feature Set Selection
26
tent of a particular
ability.
feature can be estimated only with a certain reli-
In fact, the more candidate features one examines, the greater
the chance of finding one that
'looks'
much better than it
really is.
By
considering only binary questions, we limit the number of candidate features
considered.
Thus,
if we find any features that 'look'
good,
they
probably are.
To simplify implementation
Finally, the exclusive use of binary features greatly simplifies the programs that are used to generate, compare,
and select individual features.
Since the answer to a binary question can assume only two different values
('true'
or
'false'),
only one bit of memory is needed to represent each
binary feature of a context.
The features
that describe a context can
therefore be conveniently viewed as a bit vector, and the problem of predicting the pronunciation of a context can be seen as a bit-vector classification problem.
3.5
TOWARDS AN OPTIMAL FEATURE SET
The problem of selecting the best subset of a set of feature's ('measurement selection') has been widely studied, but no efficient and generally
applicable method appears to have been found.
The difficulty of the prob-
lem is illustrated with examples by [Elas67] and by [Tous7l], the latter
showing that the best subset of m out of n binary features (m=2 and n=3 in
his example)
need not include the best single feature and may in fact con-
sist of the m worst single features.
This,
and similar results obtained
elsewhere, suggests that in general, an optimal feature set can be found
Feature Set Selection
27
only by enumerating and evaluating a very large number of candidate feature sets.
We are therefore forced to use a sub-optimal method for fea-
ture set selection.
In order to make the task of feature selection more manageable,
we will
only consider features that refer to a single element of the context only
(a letter or a phone).
This allows us to subdivide the feature selection
problem into a number of independent sub-problems,
suitable features
namely the selection of
for each of the elements of a context.
Since a single
element of a context may assume only 26 to 50 different values, the feature
selection
problem for such elements
entire contexts.
is much easier than that for
(Approximately 1015 different contexts are possible).
The restriction that each question can refer only to a single letter or
phone is actually quite severe:
if no questions about multiple elements
of the context are included in the feature set, any pattern that involves
multiple elements (such as a digraph) can be identified only by identifying each of its constituent elements.
rules will be more complicated.
This means that the corresponding
The restriction is justified only by the
fact that we have no computationally attractive method of identifying such
more general features.
Note: The definition of 'context
phones)
elements'
calls for elements (letters or
at a particular distance ('offset') away from the current letter.
When the current letter is near a word boundary,
letter or phone at a given distance.
there may be no actual
To deal with this situation, the
letter alphabet and the phone alphabet are each augmented by the distinguished symbol
presumed
to
'#',
be
and the phone-
extended
with
an
and letter strings of all contexts are
infinite
sequence
of
'#'
in
both
directions.
Feature Set Selection
28
Phone
Letter
Mutual
at
Mutual
at
Offset
Information
Offset
Information
-4
0.110
-3
0.161
-3
0.130
-2
0.307
-2
0.217
-1
0.785
-1
0.636
0
2.993
+1
0.809
+2
0.351 (*)
+3
0.197
+4
0.122
Figure 6.
Information Contributed by Letters and Phones:
This table
shows the Mutual Information between a letter (or phone)
at
a
given offset away from the current
pronunciation of
the
current
letter.
letter and the
For example, the
second letter to the right of the current letter provides
an average of 0.351 bits of information (*).
3.6
USING NEARBY LETTERS AND PHONES ONLY
On the average, the entropy or mathematical uncertainty (H)
ciation of a letter is 4.342 bits.
of the pronun-
This difficulty is roughly equivalent
to a choice, for each letter, between 24.342 = 20.3 equally likely alternatives. 12
12
This value, 2H, is called the perplexity.
age number of equally likely alternatives'
The term 'equivalent averis intended as a synonym
for 'perplexity'.
Feature Set Selection
29
The pronunciation of the average letter is determined mostly by the identity of the letter itself and its immediate context, and to a much lesser
extent by letters that are further away.
Indeed, the mutual information
between the pronunciation of a given letter and the identity of the letters
surrounding
shown
in
it
decreases
rapidly with increasing distance,
Figure 6 on page 29.
According to this table,
as is
revealing the
identity of the current letter reduces the entropy of its pronunciation to
4.342-2.993=1.349 bits, equivalent to a choice between an average of 2.5
equally likely alternatives.
13
Also,
it
can be seen that letters to the
right (corresponding to a positive offset) are more informative than letters to the left.
This may be due in part to the fact that where a digraph
corresponds to a single phone, that phone is aligned with the first letter
in the digraph.
A more useful measure of the information content of far-away letters is
the amount of additional information that they provide if the identities
of closer-by letters are already known.
the conditional mutual information.)
tion
(I will refer to this quantity as
Because the total amount of informa-
contributed cannot exceed the entropy of the pronunciation
itself
(which is about 4.342 bits), the conditional mutual information falls off
much more rapidly with increasing distance than the unconditional mutual
information.
This
is illustrated
in Figure 7 on page 31.
far-away letters contain only very little
The fact that
information about the pronun-
ciation of the current letter justifies our decision to limit the feature
set to questions that refer to nearby letters only.
13
The figures cited here were obtained from our sample data, and are not
(necessarily)
In particular,
regard
representative
of,
for example,
running English text.
the sample data was drawn from a dictionary without
to word frequencies,
and filtered to eliminate entries con-
taining blanks or punctuation.
Furthermore, the figures cited depend
on the phone alphabet used.
Feature Set Selection
30
Letter
Cumulative
Additional
Mutual
Mutual
Mutual
at
Offset
Information
Information
0
2.993
2.993
+1
0.809
3.440
0.446
-1
0.636
3.797
0.357
+2
0.351
4.022
0.225
-2
0.217
4.130
0.108
+3
0.197
4.180
0.050
-3
0.130
4.199
0.019
+4
0.122
4.214
0.015
-4
0.110
4.220
0.006
Figure 7.
Information
Contributed
Information
by
Letters,
Cumulative:
This
table shows how much information (about the pronunciation
of
the
identity
current
letter)
is
obtained
by determining the
of the surrounding letters in the order shown.
The current letter provides 2.993 bits of information; the
letter to its right yields another 0.446 bits; the letter
to its left yields another 0.357 bits and so on.
The mutual information contributed by phones at increasing distances is
subject to a similar decline:
see Figure 8 on page 32.
In this figure,
the first table indicates how much information the phones to the left of
the current letter provide if the identity of this letter is known (hence
the figure of 2.993 at the top of the 'cumulative'
column).
The second
table indicates how much information they provide if the four nearest letters on each side of the current letter are known as well (hence the figure
of
4.220 at the top of the 'cumulative'
column).
Note that these
figures are much lower, and fall off even more rapidly than those in the
Feature Set Selection
31
first
table.
These figures provide some theoretical justification for the
fact that we do not include questions about far-away phones in the feature
set.
In Figure 8,
note the figure on the bottom line: the amount of information
provided by letters -4. .+4 and phones -3.. -1
entropy
of
the unknown pronunciation
totals 4.239 bits.
is 4.342 bits,
Since the
we are 0.103 bits
short and therefore cannot predict the pronunciation with perfect accuracy.
This uncertainty is due to two factors:
1.
even if all available information is taken into account, the pronunciation of a context may still be ambiguous (in the word 'READ', for
example); furthermore,
2.
all contextual information that is more than 4 letters or 3 phones
away is ignored.
It is not clear how much of the difference is due to each of these factors.
Note that an entropy of 0.103 is roughly equivalent to a choice between an
average of 20.103 = 1.07 equally likely alternatives.
3.7
TOWARDS A SET OF 'GOOD' QUESTIONS
This section describes the method used for finding good binary questions
about letters and phones.
tives
outlined before:
questions,
and
a
This method was designed to satisfy the objecmaximal
maximum
of
information content of the individual
independence
so
that
a
combination of
questions yields as much information as possible.
Feature Set Selection
32
Phone
Cumulative
Additional
Mutual
Mutual
Mutual
at
Offset
Information
Information
Information
2.993
-1
0.785
3.398
0.405
-2
0.307
3.630
0.233
-3
0.161
3.798
0.167
4.220
-1
0.785
4.234
0.014
-2
0.307
4.238
0.004
-3
0.161
4.239
0.002
Figure 8.
Information Contributed by Phones, Cumulative:
This table
shows how much information (about the pronunciation of the
current letter) is obtained by determining the identity of
successive phones to the left of the current letter.
In
the top table, the identity of the current letter is known;
in the bottom table, the identity of the letters at offsets
-4..+4 are known.
Note: Since the same method is used for both letters and phones,
tion
explains
pronunciations.
the
method
only
in
terms
of
letters
this secand
their
The generalization to phones is given in "Questions about
Phones" on page 40 f f . .
The only source of information used in the feature selection process for
the 'current letter' is an array that describes the joint probability distribution of letters and their pronunciations,
p(L, x)
Feature Set Selection
33
where
L
is
a
letter and x is its pronunciation.
derived from the aligned letter-
This information
is
and phone strings that constitute the
training data for the model.
For now, we will assume the existence of a method for determining the single best question about a letter, given a particular optimality criterion.
We
can
then
proceed
as
follows.
First, we determine the single most
informative binary question Q about a letter L: that which maximizes
MI(Q(L) ; x),
where x is the pronunciation of L (and MI stands for Mutual Information).
We will call this question QL1
1).
(question about the current Letter, no.
By definition, this choice of QL1 minimizes the conditional entropy,
or remaining uncertainty, of the pronunciation of a letter L given Q(L).
The question QL1 obtained by this method is shown in Figure 9 on page 35.
The mutual information between QL1(L) and the pronunciation of L is 0.865
bits, which is close to the theoretical maximum of 1 bit.
After QL1 is found,
about a letter L:
we determine the most informative second question
Q
that which maximizes
MI(Q(L) ; x I QL1(L))
(Note that this is not the same as the second-most informative question).
We will call this question QL2.
The question QL2 obtained by this method
is illustrated in Figure 10 on page 35. The amount of additional information obtained by QL2,
MI(QL2(L) ; x
I QL1(L)),
Feature Set Selection
34
Q1 (L) is 'true'
for {B C D F K L M N P Q R S T V X ZI,
for {# A E G H I J O U W Y '}.
Q1 (L) is 'false'
Figure 9.
The Best Single Question about the Current Letter
is 0.722 bits. Thus,
QLl(L) and QL2(L) together provide 0.865+0.722=1.587
bits of information about the pronunciation of L, which is more than half
of the total amount of information carried by L (2.993 bits, as indicated
in Figure 6 on page 29).
We continue to generate new questions,
M.I(QL (L) ; x I QLI(L), QL2 (L),
...
QL (L), while maximizing
,
QL i(L))
This process is continued until every letter L is completely specified by
the answers to each of the questions about it:
QL1 (L), QL2 (L),
...
, QLn(L)
-(determine)->
L
Q 2 (L) is 'true' for {A C D E I K M 0 Q S T U X Y Z '},
Q 2 (L) is 'false' for {# B F G H J L N P R V W}.
Q1 (L) and Q2 (L) together divide the alphabet into four groups:
1:
# G H J W}
2: (A E I 0 U Y '}
3: {B F L N P R V}
4: {C D K M Q S T X ZI
Figure 10.
The Best Second Question about the Current Letter
Feature Set Selection
35
Since the answers QL1 (L) through QL (L)
completely define L,
it
is not
possible to extract any further information from it.
As the reader will recall, this description of the question-finding algorithm
assumes
questions
the
that
existence
maximize the
of
a
method
various
for identifying each of the
objective
functions given above.
Unfortunately, even the most efficient methods we know for finding guaranteed optimal questions would require a disproportionate amount of computation.
Therefore,
we must once again resort to a sub-optimal method.
The algorithm used is described in detail in "The Clustering Algorithm" on
page
98
ff..
By supplying
different
optimality
criteria,
different
questions may be derived.
3.8
DETAILS OF FEATURE SET SELECTION
By invoking the clustering program with different parameters (in particular,
different probability distribution arrays and different objective
functions),
we
have
obtained
a variety
of different
questions.
The
questions are listed in "List of Features Used" on page 101 ff..
Questions about the Current Letter
Questions QL1 through QL7 were obtained by clustering letters with the
objective of maximizing
MI(QL.(L) ; x
where
x
is
the
I QL 1 (L),QL2 (L),.. .,QL.
(L)),
pronunciation of the letter L.
In other words, these
questions are the best first question about the current letter, the best
Feature Set Selection
36
second question given the first, the best third question given the first
two,
and so on (the name QL names a Question about a Letter).
In cases
where several different partitions would provide the same amount of mutual
information by this definition, ties were resolved by giving preference to
partitions that had the highest unconditional mutual information
MI(QL.(L) ; x).
Thus,
each question obtained is not only the best i'th
the limits of the algorithm),
question (within
but also a good question when considered by
itself.
Let me point out some of the salient features of the questions QL1. .7 and
the
partitions
of
the
letter alphabet that they define
(as
listed in
"Questions about the current letter" on page 101 ff.).
*
The amount of (conditional) mutual information provided by the first
few questions is near the theoretical limit of 1.0,
idly for further questions.
and falls off rap-
This confirms our conjecture that a few
features contain most of the information in a context, and that these
features
can be identified automatically.
It is inevitable that the
conditional mutual information provided by later questions falls off,
since the sum of the conditional mutual information is fixed at 2.993
bits (see Figure 6 on page 29).
*
The
unconditional mutual
information of successive questions
falls
off as the questions become more specialized in making distinctions
not made by preceding questions,
but then increases to nearly its ori-
ginal value as the number of clusters to be broken up increases and
the clustering program utilizes the resulting freedom to maximize the
secondary objective function as well.
Feature Set Selection
37
*
In general,
it
can be seen that as the alphabet is partitioned into
smaller and smaller subsets,
lar
letters.
between 'K'
for
this
In
particular,
and 'Q'
is
each subset contains more and more simiit
can
be seen that the distinction
is not drawn until after six questions. The reason
clear:
both
are pronounced /k/
with high likelihood.
Indeed, the distinction between 'K' and 'Q' is worth only 0.0008 bits
of information as is indicated at QL7.
and apostrophe (''')
Also,
it
turns out that 'E'
remain together for a long time. This is not sur-
prising considering the use of the apostrophe in words such as "I'LL"
and "YOU'LL".
*
One may wonder why the first question does not distinguish between
vowels and consonants, since this is the most 'obvious'
fication of letters.
Indeed,
it
binary classi-
turns out that the question 'is let-
ter L a vowel' conveys 1.0 bits of information about the identity of
L, since,
in our sample data,
a randomly selected letter has a proba-
bility of 0.50 of being a vowel.
However,
a simple calculation (not
given here) shows that the question would contribute only 0.790 bits
of information about the pronunciation of L,
which is somewhat less
than the 0.865 bits contributed by QL1.
Questions about the letter at offset -1
Questions QLL1 through QLL6 (Questions about the Letter to the Left) were
derived specifically to capture the information contained in the letter to
the left of the current letter, given the identity of the cur'rent letter.
(Because of the order in which letters were added in Figure 7 on page 31,
the amount of mutual information obtainable from this letter is not listed
there.
It is 0.346 bits.)
The questions (which are listed in "Questions
about the letter at offset -1"
objective
were obtained with the
of maximizing the mutual information between QLLi(L)
pronunciation
letter.
on page 102 ff.),
of
the
letter
In other words,
Feature Set Selection
following
L,
given
and the
the identity of that
after determining the identity of the current
38
letter, QLLl(L) is the best single question about the letter immediately
to its left,
derive
QLL2(L) is the best second question,
these questions,
and so on.
In order to
a variant of the clustering algorithm had to be
developed.
This
new
clustering task is somewhat more complicated than the simple
clustering of letters.
More importantly,
it
also requires more space: it
is necessary to store the probability distribution array
p(L, LL, X(L))
where L is the current letter,
LL is the letter to its left, and X(L) its
pronunciation. This array contains over 140,000 entries.
Questions about the letter at offset +1
Questions QLR1 through QLR7 (questions about the Letter to the Right) were
derived specifically to capture the information carried by the letter to
the right of the current
letter,
indicated in Figure 7 on page 31,
given the identity of the latter.
this letter contains an average of 0.446
bits of information about the pronunciation of the current letter.
questions,
As
which are given in "Questions about the letter at offset +1"i
page 104 ff., were obtained by the same method used for QLL1. .6,
The
on
although
the joint probability distribution used was of course
p(L, LR, X(L))
where L is the current letter, LR is the letter to its right, and X(L) its
pronunciation.
Feature Set Selection
39
Questions about the remaining letters
As indicated above, the current letter and the letters immediately adjacent to it
are represented in the feature set by features which were espe-
cially chosen to be optimal for this purpose.
Unfortunately,
our cluster-
ing program in its present form is unable to perform this task for letters
at larger distances from the current letter, because the amount of storage
required to represent the joint probability distribution arrays increases
by a factor of 26 for every unit of increase in the distance from the current letter.
14
Therefore,
we cannot derive questions specifically to cap-
ture the relevant features of far-away letters.
Given the choice between
several sub-optimal question selection methods, we decided simply to use
QLL1. .6,
*
for the following reasons:
Any set of questions derived on the basis of properties of the letters
is probably better than one which is not (such as a Huffman code or a
binary representation of the position of the letter in the alphabet)
*
QL1. .7
have some unpleasant properties,
between
letter-
between
'K
and phone
and 'Q'
strings:
due to the forced alignment
for
example,
the
is found to be nearly irrelevant.
distinction
There may be
other, less obvious anomalies
*
When faced with the choice between QLL1. .6 and QLR1. .7,
I simply chose
the set which requires the smaller number of bits to represent each
letter, lacking any clues as to which would perform better.
14
For the a letter at a distance of (plus or minus) two from the current
letter,
the
probability
distribution array would occupy nearly 17
megabytes of storage.
Feature Set Selection
40
Thus,
56
the lexical context of the 'current
bits:
seven bits each
letter'
for the current
is expressed in terms of
letter and the letter to its
right, and six bits each for the remaining letters from offset -4 to +4.
Questions about Phones
The
same
clustering algorithm that is used to identify good questions
about the current letter,
is also applied to the phone alphabet.
time, the objective is to identify good questions
immediately
precedes
the
current
letter.
This
about the phone that
The objective function, the
mutual information between the preceding phone and the pronunciation of
the current
letter,
can be computed from the joint probability distrib-
ution
p(P, x),
where x is the pronunciation of the current letter and P is the preceding
phone.
The questions obtained (QPl..8,
page 105 ff.)
separated
see "Questions about phones" on
have some amount of intuitive appeal,
from
consonants
rather
quickly,
in that vowels are
while the phones
that stay
together for a long time appear to be rather similar.
For the benefit of skeptics (including the author) who would like to see a
'rationalization'
questions,
of
some
of
the
more
First,
or
of these
namely the
and 'ERl' in opposite classes in
of course, the present assignment of these two phones results
in the highest Mutual
subset
aspects
a single example was investigated in more detail,
apparently paradoxical placement of 'ERO'
QPl.
questionable
swapping
Information:
moving either phone to the opposite
the two phones yields an inferior result.
inspection showed that the phones occur in quite distinct contexts:
is followed by <empty> with a probability of 0.46 and by 'ZX'
Feature Set Selection
Further
'ERO'
with a prob-
41
ability of 0.16, whereas
'NX',
'MX',
'ER1' is followed by 'SX', <empty>, 'TX', 'KX',
'DX' and 'VX' with a probability of between 0.100 and 0.054
each.
Each
of
the
three
phones
immediately preceding the current letter
is
represented in the feature set by eight bits, namely the answers to QL1
through QP8 for each phone.
that
represent
letters,
Since the feature set also contains 56 bits
the
feature
set
consists
of
a total
of
80
features.
Other Features
We have experimented with a variety of other features.
Some of these were
simply
already otherwise
more
available;
representations
of
information
others provided information not obtainable from the regular
feature set.
*
compact
The following features have been used in experiments:
the stress pattern on vowel phones in the phones to the left of the
current letter, represented as a vector of stress levels
*
the location and length of consonant clusters in the phone sequence
already predicted
*
whether or not a primary stressed syllable has already been produced.
Other features were considered, but never actually used as we went back to
the more basic feature set:
*
The position of the current letter in the word
*
The
first
and
last
letters
of
the
word,
or
perhaps
some
characterization of the final suffix of the word (suffixes are given
special treatment in some existing spelling to sound systems, such as
[Elov76]).
Feature Set Selection
42
In practice,
we have found that the details of the feature set do not seem
to matter much.
This is only true only up to a point, of course:
not tried any feature sets that are predictably bad.
that
we
either).
have
not
tried
any
feature
sets
that
we have
(Some might argue
are predictably good,
The method described in this chapter yields acceptable result
with a minimum of human intervention.
Feature Set Selection
43
4.0
DECISION TREE DESIGN
This chapter describes the algorithm used to design the decision tree that
contains
the
spelling
to
base
form rules.
The exposition assumes an
understanding of the concept of a decision tree.
The problem at hand requires that a decision tree be built to classify
bit-vectors of approximately
80 bits each.
This must be done automat-
ically, based on an analysis of up to 500,000 labeled sample bit-vectors.
Each vector must be classified as belonging to one of approximately 200
classes.
*
We will make the following assumptions about the data vectors:
Each bit-vector may belong to more than one class, i.e.
each context
may have more than one pronunciation;
*
Any output class can correspond to a large number of different underlying
patterns,
which
may
correspond
to
completely
different
bit-vectors;
*
Some of the bits in the vector contain much more information than others,
and only a few of these informative bits need to be inspected to
classify a particular vector;
*
The cost of testing a bit is constant (the same for all bits).
It is possible that we may have made some hidden assumptions that are not
listed here.
The exposition in this chapter will cover the following subjects:
*
related work (existing decision tree design algorithms)
*
the theory behind the decision tree design algorithm
*
a set of ad-hoc mechanisms to limit the growth of the decision trees,
and their rationalizations
Decision Tree Design
44
*
an efficient implementation of the decision tree design algorithm
*
a discussion of the complexity of this implementation
*
methods by which to analyze a decision tree, to determine how 'good'
is (in terms of classification accuracy, space requirements and so
it
on), and
*
the method used to 'smooth' the sample data distributions associated
with the leaves of the decision tree.
4.1
RELATED WORK
The automatic and interactive design of decision tree classifiers has been
subject of much study [Case8l]
[Hart82] [Hyaf76]
[Payn77] [Stof74].
Most
of the reported work deals either with the design of optimal trees for
very simple problems or with sub-optimal trees for problems that are more
complicated or involve a lot of data.
Among the papers that deal with the
classification of discrete feature vectors, [Case8l] and [Stof74] appear
to be the most applicable to the present problem.
Casey
[Case81]
addresses
character recognition.
ments
(pels),
the problem of designing a decision tree for
He wishes to classify arrays of 25x40 picture ele-
represented as a 1000-bit vectors.
His classifier design
algorithm obtains much of its efficiency due to the simplifying assumption
that the individual bits in a bit-vector are conditionally 'independent,
given the character that they represent.
Casey reports that for the data
to which his algorithm was applied (bit-map representations of characters
in
a
single
justified.
font),
the
assumption
of
conditional
independence
is
In part, this may be the case because every character corre-
sponds to a single pattern.
Decision Tree Design
45
Unfortunately, the bit-vectors that we need to classify do not appear to
exhibit
this property,
variety
of different patterns:
come from an 'S'
because most pronunciations
, a 'Z'
for example,
or an 'X'.
can correspond
the pronunciation
to a
/z/
can
We are therefore unable to use Casey's
algorithm directly, although we will employ some of the same heuristics.
Stoffel [Stof74] presents a general method for the design of classifiers
for vectors of discrete,
uncrdered variables
(such as bit-vectors).
He
presents a worked-out example in which his method is used to characterize
patterns
in
(8x12)
bit-maps that represent
characters.
In order to be
able to deal with large amounts of sample data, he makes the simplifying
assumption that most of the bits in the vector are approximately equally
informative.
This allows him to define the distance between vectors in
terms of the number of positions in which they have different values.
He
then places vectors that differ in only a few bits in equivalence classes,
thereby reducing the size of the problem.
It appears that in the data to
which he applied his method, bit-map representations of characters,
it
is
indeed true that most bits are approximately equally informative.
Unfortunately,
the bit-vectors
that we need to classify do not exhibit
this property: we know that some bits carry as much as 0.9 bits of information, while others carry 0.02 bits or less.
This results from the fact
that 'nearby'
letters and phones are much more informative than those that
are
away.
further
While
we
cannot
adopt this particular simplifying
assumption, it seems that some variation of his method might be used for
data reduction and to improve the modelling capability of the system. No
such improvements are presently planned.
Decision Tree Design
46
4.2
CONSTRUCTING A BINARY DECISION TREE
A binary decision tree for predicting the label (pronunciation) of a bit
vector (context) may be constructed as follows:
1.
Consider an infinite,
rooted binary tree (growing downward from the
root).
2.
Designate a suitable set of nodes of this tree to be the leaves of the
tree.
(I will refer to the set of leaf nodes as the fringe of the
tree). Discard any nodes that lie below the fringe.
3.
Associate a decision function with each internal node in the remaining
tree.
4.
Associate with each leaf node of the tree a probability distribution
over the set of possible labels.
The decision function at each node in the tree is selected on the basis of
an analysis of the sample data that would be routed to that node during
the classification process.
It follows
that the decision function of a
node can be determined only if the decision functions for the nodes along
the path between it
and the root are known - thus, they must be assigned to
the nodes of the tree in a top-down order.
It also follows that each sam-
ple contributes to the selection of several decision functions, namely one
for each internal node along the path from the root to the leaf to which it
is eventually classified.
Finally, it
follows that each sample contrib-
utes to the selection of at most one decision function at each level in
the tree.
(A level consists of nodes that are at the same distance from
the root).
This property will be exploited later.
Decision Tree Design
47
4.3
SELECTING DECISION
FUNCTIONS
When the decision function of a given node must be selected, the decision
tree design algorithm operates as follows.
The sample data routed to this
node (by the portion of the decision tree that lies above it)
to
be
available.
Recall
that
this
sample
data
consists
is presumed
of
labeled
bit-vectors or pairs
(L, B[1..n]},
and B[1. .n] represents a vector of n
where L is a label (a pronunciation),
bits.
From the sample data, we may compute the mutual information between
L and each of the bits in the bit-vector, B[i]:
MI(L ; B[i]).
The decision function, then,
this mutual
is simply a test against the bit i for which
information is greatest.
This heuristic (which for this par-
ticular problem was suggested by R.
and undoubtedly by others.
Mercer) is also employed by [Case81],
[Hart82]
shows that this choice of decision
function minimizes an upper bound on several commonly used cost functions
for the resulting tree.
The
following
example
at
the
decision
root
of
function
questions
i, the mutual information MI(B[i]; L) is calculated. By defi-
these values must lie between zero and one.
there are about 80 questions;
of these,
25
the
tree.
selection
consider
ation,
situation
the
process:
nition,
the
illustrates
For
all
In a typical situcontribute less than
0.05 bits of information each, 41 contribute between 0.05 and 0.10 bits, 7
contribute between 0.10 and 0.50 bits,
Thus,
and 7 contribute more than 7 bits.
the most informative question is indeed much more informative than a
question selected at random.
Decision Tree Design
In our example, the question 'Is the current
48
letter one of (#,
A,
namely
information,
E,
G,
0.865
H,
I,
J, 0,
bits.
U, W,
Thus,
it
Y,
contributes the most
'}'
is selected as the decision
function at the root of the decision tree.
The decision function is now used to split the sample data into two subsets, namely (1)
E, G,
H,
I,
J, 0,
the samples in which the current letter is one of (#, A,
U, W,
Y,
'},
and (2)
the samples in which it
is not. One
of these subsets of the sample data set corresponds to the left branch of
the decision tree, the other to the right branch.
Since we started with a
label entropy of 4.342 bits, and were able to obtain 0.865 bits of information from the first question, the weighted average label entropy at the
leaves of the tree is now 4.342-0.865 = 3.477 bits.
in the branch corresponding to the 'vowels'
In fact,
the entropy
is about 4.2 bits, and the
entropy in the branch corresponding to the 'consonants'
is about 2.8 bits.
For each subtree separately, the above procedure is repeated: the data are
analyzed and the Mutual Information between B[i] and L is calculated for
all values of i.
In each subtree, the single best question is selected as
the decision function and used to further partition the data.
out,
the next question about
information,
bits.
reducing
'vowels'
As it
turns
provides as much as 0.95 bits of
the entropy in that branch from 4.2 bits to 3.2
The next question about
'consonants'
is somewhat less effective,
providing only 0.53 bits of information.
This process continues along all the branches of the tree,
tree until it
contains as many as 15,000 leaves.
extending the
It would be impractical
to investigate every branch of the tree manually; statistical tools are
used for further analysis (see "Analysis of Decision Trees" on page 59).
Because the decision tree design algorithm operates without 'look-ahead'
and simply selects the best available single question, I will refer to it
as
the Best
Question First, or BQF algorithm.
The BQF heuristic has a
number of desirable properties:
Decision Tree Design
49
1. Only the single most informative question is used, hence the selection
is in general unaffected by the presence of uninformative features
2.
The binary
since
a
trees built by this method tend to be fairly balanced,
balanced
tree
offers
the
greatest
potential
for entropy
reduction
3.
Only sample data local to the node is used in the selection of the
decision function, hence the tree can be constructed efficiently
4.
The algorithm is incremental
need
not
be
specified
in nature; thus,
the desired tree size
in advance but can be allowed to depend on
observations made during tree construction.
It is also possible to
refine an existing tree.
Only tests of single bits in the bit-vector are considered as possible
decision functions. There are several reasons for this:
1.
The decision function is selected from a small class of candidates, so
this selection can be made easily
2.
The decision function is easy to represent (only the index i needs to
be specified)
3.
The
test
can be
performed quickly (only a single bit needs to be
inspected)
4.
There is only one degree of freedom in the selection mechanism; thus,
a decision
function can be selected even if only a small number of
samples is available.
Unfortunately,
the algorithm is sub-optimal,
as it must be since the prob-
lem of constructing an optimal decision tree is in general NP-complete
Decision Tree Design
50
[Hvaf76].
ance
of
Furthermore,
this
heuristic
there appear to be no tight bounds on the performunder
realistic
conditions,
although
(Case8l]
gives an upper bound on the performance under certain conditions.
Consid-
ering the enormous
cost of constructing an optimal decision tree,
this
algorithm appears to provide a good compromise.
4.4
DEFINING THE TREE FRINGE
In order to obtain a useful decision tree,
growth of the tree.
tests,
which
it
is necessary to limit the
We will do this by means of a number of termination
together
specify the tree fringe:
against a node fails, the node is not extended.
several termination tests simultaneously;
passes all the termination tests.
if a termination test
In general, we will use
a node is extended only if it
The following is a list of the termi-
nation tests that we have implemented to date. Tests that are always used
are listed first; optional and experimental tests follow.
1.
If all the samples at a particular node have the same label L,
entropy of L is zero and no question can reduce it
any further.
the
Thus,
there is no need to extend the node.
2.
If all the samples at a particular node have identical bit vectors
B(l..n],
the samples are indistinguishable.
Therefore,
no question
will break up the sample data and reduce the entropy of L. Thus,
there
is no way of extending the node.
This situation represents an ambiguity in the bit vectors, which may
occur for one of two reasons:
Decision Tree Design
51
a.
the word in which the context occurs has multiple pronunciations
/r eh d/)
(such as READ /r ee d/,
b.
the same context occurs in several different words,
pronounced differently.
*
3.
in which it
is
For example:
NATION, NATIONAL (/n ei sh uh n/, /n ae sh uh n uh 1/)
If the product of entropy and node probability falls below a certain
threshold, the node is not extended.
To the extent that the entropy of a L at a node is a good approximation
of the entropy reduction that can be obtained by extending the node,
this test defines a tree fringe corresponding to a local optimum in
the trade-off between tree size and expected average classification
accuracy.
4.
If the number of data samples that corresponds to a node is below a
certain threshold, the node is not extended.
The purpose of this termination test is twofold:
to avoid selecting a
decision function on the basis of insufficient sample data,
and to
place an upper bound on the size of the tree.
5.
If
the
label distributions at the immediate descendants
would not be different with a statistical
significance
of a node
level of at
least, say, 99.95%, the node is not extended.
The purpose of this termination test is to avoid extending a node if
no question can separate the data at that node into two distinct populations.
nificance
A Chi-squared contingency test is used (a<0.05%).
level
Decision Tree Design
of the
test
must
be
relatively high,
The sigsince
the
52
separation of the distributions corresponding to the descendants
already the best out of n,
is
where n is the number of bits in a bit vec-
tor.
6.
Each of the sample vectors is extended ('padded')
sisting of random bits, each of which is
0.5.
Then,
'1' with a probability of
the mutual information between each bit (including the
random bits) and the labels L is computed.
does
not
with a vector con-
If the best
'real' bit
provide more information about L than each of the random
bits, the node is not extended.
The
purpose
of
this
Chi-squared test.
termination
test
is the same as that of the
While we could simply set a threshold on the mini-
mal mutual information required for extending a node,
this threshold
should in practice depend on sample size at the node and the distribution of labels L.
ter
to
be
trade-off
The present scheme requires only a single parame-
specified:
between
the
the
number of
random bits,
m.
There is a
selectivity of the test and execution time,
since the mutual information of each of the m random bits must be computed.
7.
In practice, we have used values of m=20 to m=n.
At each node, the product of entropy and node probability is computed.
This quantity is presumed to decline along the path from the root to a
leaf.
path
If the decline in this value over the last few nodes along a
does
not
exceed
a
certain ,threshold
value,
the path is not
extended.
This termination test imposes a 'progress
requirement'.
The product
of entropy and probability is used because the entropy itself does not
decline
along all paths:
the BQF algorithm sometimes
selects tests
that divide the data for a node into a large class with low entropy
and a small class with an even higher entropy than the original data.
Decision Tree Design
53
4.5
AN EFFICIENT
IMPLEMENTATION
OF THE BQF ALGORITHM
Processing Order
The BQF heuristic can be applied to a node only if the sample data for the
node
is known.
Therefore,
top-down fashion.
this
task:
the nodes must be processed in some sort of
Three different processing orders appear suitable for
depth-first
breadth-first order.
order,
some
type
of
best-first
order
and
We have considered and implemented both best-first
and breadth-first order.
The original implementation (by R.
order.
Mercer)
extended nodes in best-first
The objective of the evaluation function was to identify the node
and the test at that node that would constitute the best possible potential refinement of the tree.
computation,
Since it is not practical to perform this
the node with the highest product of entropy (of L) and prob-
ability (of the sample data at the node) was extended.
upper
bound
This product is an
on the imprcbvement in classification accuracy that can be
obtained by extending the node.
Unfortunately,
storage since it
this program made
inefficient use of access to external
required periodic scans of the 800,000
labeled sample
vectors to find the sample data corresponding to the nodes that were to be
extended.
For efficiency,
these accesses were performed in batches: on
each scan, the data for best n nodes was read, where n was limited to 40 by
the amount of available memory.
Nevertheless,
the number of scans was
still proportional to the number of nodes in the tree. This made it very
expensive to construct large trees.
In order to utilize each scan of the external file better, the program was
re-written to process nodes in breadth-first order.
This processing order
implies that all the nodes at a particular level are processed sequential-
Decision Tree Design
54
ly.
Every sample vector corresponds to at most one node at any given lev-
el (or to none,
if it
corresponds to a leaf at a higher level). Thus, the
nodes at each level define a partition of the sample data.
sample data in the external file can be ordered so that it
Therefore, the
can be read in a
single sequential scan as the nodes at a given level are processed.
(The
data in a file so ordered is said to be aligned with the nodes at that level in the tree).
Since the partition at each level is a refinement of the
partition at the previous (higher) level, it is possible to re-order any
aligned file to be aligned with the next level in the tree in 0(n) time and
space, as will be shown later.
When proceeding
in this fashion,
the number of scans of the sample data
file is proportional to the depth of the eventual tree, rather than to the
number of nodes
in the tree.
Thus,
the number of times that the sample
data file has to be read has been reduced from
(the number of nodes in the tree) - 40,
or between 100 and 7,500,
to
(the number of levels in the tree)
or
between
60 and 90.
x
3,
This improvement
is significant,
especially
for
large trees.
Details of Operation
The new BQF program builds the tree one level at a time, starting at the
root node. The nodes at each level are extended in left-to-right order.
will
first
describe
Decision Tree Design
how
the nodes
I
at a given level can be processed
55
sequentially,
given that the input file is properly ordered;
I will then
show how the sample data can be efficiently re-ordered, so that it will be
aligned with the nodes at the next level.
The sample data for each node consists of a set of labeled bit-vectors,
(L, B[l..n]}.
The decision function for a node is defined by the integer i that maximizes
MI(L ; B[i])
or H(L)
- H(L
I B[i]).
The quantity H(L) does not depend on i, and may be ignored.
The condi-
tional entropy of L given B[i] may be computed as follows:
H(L
I B[i]) =
EL( p(L
I B[i])xlog 2p(L I B[i]) )
where ZL denotes a summation over all label values L. The probability distribution p(L
I B[i]) can be computed from the observed frequencies of
{L, B[i]} and B[i]
in the sample data corresponding to the node.
the
node
is
all
that
Indeed, the sample data for
is required to perform this computation.
Thus,
assuming the data in the input file is aligned with the nodes at the current
level in the tree,
one,
and this takes no more memory than is used to process a single node
the nodes at that level can be processed one by
since the storage space can be reused.
While the nodes at a given level are being processed,
it
is possible to
copy the sample data to a new file, such that the data in the new file will
be aligned with the nodes at the next level in the tree. This is done as
Decision Tree Design
56
follows:
Then,
Before processing of a level begins, an empty file is allocated.
the nodes at that level are processed one by one.
After a node has
been processed (and its decision function i has been determined),
the por-
tion of the input file that corresponds to the node is scanned twice.
the samples for the left subtree of the node (the samples
scan,
the first
On
for which B[i] = '0') are copied into the output file; on the second scan,
the samples for the right subtree (for which B[i] = '1')
the output file.
Since B[i]
must be either
copied to the output file exactly once.
'O' or
'1',
are copied into
each sample is
It is clear that the data for the
node is now properly ordered for the next level: all the data for the left
descendant
The
of the node comes before the data for the right descendant.
two scans of the input file needed to copy the data can be accom-
plished by maintaining two additional pointers into the input file. If an
efficient
'seek'
operation is available, it
tion a single file pointer as needed.
may be used instead to reposi-
When all the nodes at the level have
been processed, the output file becomes the input file for the next pass.
If a decision is made not to extend a particular node,
two options are
available:
1.
The
data
corresponding
to
the node is copied to the output file.
Since the node has no descendants,
nodes at subsequent levels,
the data when it
once again).
tains
the data does not correspond to any
and a provision must be made to skip over
is encountered
(and to copy it
to the output file
This approach ensures that the output file always con-
all the samples originally contained in the input 'file.
Thus,
when the tree is complete, the data will be aligned with the fringe of
the tree, and can be used for various types of postprocessing.
2.
The data corresponding to the node is not copied to the output file at
all. Since the node has no descendants,
this also preserves the align-
ment between the data in the output file and the nodes at the next
level in the tree. This approach has the advantage that the data file
Decision Tree Design
57
always contains only the data samples that are still
needed.
When the
tree is nearly complete and only a small number of samples are still
used to refine the remaining 'active' nodes, the savings obtained by
this method can be substantial (about a factor of two,
An
outline
of
the
I suspect).
program is shown in "Overview of the Decision Tree
Design Program" on page 108.
Program Complexity
The asymptotic running time of the program consists of two components:
The input file is scanned for each level in the tree.
1.
During each
scan, the entire file must be read (presuming that the data for terminated nodes is retained).
determine whether it
by classifying it
For each sample read in, it
is necessary to
belongs to the current node. This is done simply
and determining whether it
ends up at the current
node, which takes an amount of time proportional to the depth of the
tree at that point.
Furthermore,
the frequency count for (B[i],
L}
needs to be incremented for all values of i for which B[i} is '1'.
This takes an amount of time proportional to the length of the bit
vector.
2.
For each node, the mutual information between L and each of the bits
B[i] must be computed.
This computation requires that the array con-
taining the joint frequency distribution of L and B[i] be read,
for
all i.
In all, then, the running time is of order
T = O( dN(d+n) + mnL ),
where
Decision Tree Design
58
d
is the depth of the tree
N
is the number of sample pairs {L, B[1..n]}
n
is the number of bits in each bit vector
m
is the number of (internal) nodes in the tree
L
is the size of the output alphabet.
In practice, the quadratic dependence on d (d2 N) is insignificant in comparison the remainder of the computation,
especially as long as d<n.
In
practice, then,
T = O( dNn + epsilonxd 2 N + mnL )
O( dNn + mnL ).
The amount of space (fast random access storage) required by the program
also consists of two components:
1.
The joint frequency f(L, B~i})
is stored for the current node,
for all
values of L and i;
2.
A copy of the growing tree is stored.
The asymptotic memory requirement is thus
S = O( Ln + m ).
If it
is necessary to preserve the label distributions corresponding to
the nodes, they can be written to an external file as they become available.
Finally, the program requires an amount of temporary external storage sufficient
to
store
an
extra
copy
of the sample data file;
this can be
sequential access storage, and therefore is not included in the space calculation above.
Decision Tree Design
59
The
program has
each expressed
successfully
been
applied
in terms of 80 features,
pronunciations,
to construct
a
to a set of 800,000 samples,
with an output
tree of depth 30 with 15,000
construction of tree of depth 23 with 1508 rules,
expressed in terms of 80 features,
ciations,
alphabet of 250
rules.
The
from 31550 samples,
each
with an output alphabet of 157 pronun-
required less than six minutes of CPU time on an IBM Model 3033
processor.
It
is likely that some type of data reduction,
the data is presented to the tree design program,
applied before
could reduce the compu-
tation time considerably.
4.6
ANALYSIS OF DECISION TREES
The
ultimate
error
objective of our
decision tree
rate of the Channel Model.
is
simply to minimize the
The extent to which this objective
achieved can be measured only by performing prediction experiments.
ever,
each
measured,
Thus,
decision tree has
and that give
a variety
a rough
we can evaluate different
of characteristics
is
How-
that may be
indication of the quality of the tree.
methods
of constructing
decision trees
without actually performing costly experiments.
Analysis of decision trees is of practical importance for a second reason:
we can obtain information about the features in the feature set. In fact,
this
is our first
opportunity to investigate the interactiorn between all
the individual features in the feature set.
Finally,
it
is possible to determine the effectiveness of the various ter-
mination
tests
by comparing trees
on various parameters
Decision Tree Design
constructed with different
thresholds
(such as the minimum number of samples per node,
or
60
the significance level for which to test distribution separations).
following
diagnostic
The
information was obtained about the decision trees
generated by the BQF program:
1.
the number of nodes in the tree (this is an indication of the amount
of storage required to store the tree)
2.
the average entropy at the leaves,
weighted by the leaf probability
(this is an indication of the classification accuracy of the tree)
3.
the
average
depth of the leaves,
weighted by the leaf probability
(this quantity is proportional to average classification time)
4.
the depth of the tree (this is an upper bound on classification time,
and also an indication of the cost of constructing the tree)
The following information tells us about the individual questions as well
as about the tree itself:
5.
the average number of times that each question will be used when classifying a vector (this number ranges from 0 to 1,
and is an indication
of what questions are found by the BQF program to be informative)
6.
the average depth at which each question is asked,
if at all (this
tends to be a small number for good questions, because good questions
are asked first)
The following data tells us about the effectiveness of the various termination tests used:
7.
the number and aggregate probability of the paths that were terminated
by each of the termination tests (if the number of paths terminated by
a given termination test is small, that test can be eliminated)
Decision Tree Design
61
8.
the number of leaves at each depth, and their aggregate probabilities
(this is an indication of how well balanced the tree is. If the number
probability
and
of the leaves at the lowest
levels is small,
con-
struction of the tree may have been unnecessarily costly).
As an example diagnostic,
Figure 11 lists the average number of questions
asked about each of the letters and phones in a a context (for a typical
tree).
As expected, the average rule contains many more tests of the cur-
rent letter and other nearby letters and phones, than of far-away letters
and phones.
In this example, the average total number of tests is 10.29.
While the BQF heuristic itself has no parameters that can be 'tuned',
we
have performed experiments to measure the effect of various termination
tests
and
of
the corresponding threshold values.
Therefore,
a second
method for evaluation was used: first, a tree was built using the smallest
practical set of termination criteria, with the most tolerant thresholds.
Significant parameters,
such as the entropy and label distribution at each
node, were saved with the tree.
used
to
Then, alternative termination tests were
define subsets of this existing tree,
(such as number of nodes,
the attributes
of which
and average leaf entropy) could be evaluated
without actually constructing any new trees.
Also,
for each node in the
tree, the label distributions at the leaves that descended from this node
were
compared,
appeared
to
be
and the significance
level at which these distributions
different was reported.
This procedure was especially
effective in the investigation of the effect of threshold va'lues on tree
size and classification accuracy.
In particular, binary search on a given
parameter has been used to define a subtree of a particular size for an
application that was subject to memory constraints.
Decision Tree Design
62
Letter
Number
Phone
Number
at
of
at
of
Offset
Questions
Offset
Questions
-4
0.097
-3
0.107
-3
0.064
-2
0.536
-2
0.180
-1
0.982
-1
0.840
0
4.199
+1
1.723
+2
0.955 (*)
+3
0.337
+4
0.265
Figure 11.
Number of Questions asked about Letters and Phones:
This
table shows the number of questions asked about a typical
letter (or phone)
letter
in
a
at a given offset away from the current
typical tree.
For example,
an average of
0.955 questions are asked about the second letter to the
right of the current letter (*).
4.7
DETERMINING
LEAF DISTRIBUTIONS
When the structure of the tree is completely determined,
a probability
distribution (over the pronunciation alphabet) has to be assigned to each
leaf of the tree.
Model,
it
Since these probabilities are the output of the Channel
is important that they be estimated accurately.
Unfortunately,
a simple maximum-likelihood estimate based on the training data would be a
rather poor choice, for the following reasons:
Decision Tree Design
63
*
The number of samples for each leaf is small relative to the size of
the pronunciation alphabet. Therefore,
the probability of infrequent-
ly occurring pronunciations cannot be estimated reliably.
*
The decision tree is tailored to the sample data; therefore, the distributions at the leaves are severely biased and give rise to an overly confident estimate of classification accuracy.
These problems can be alleviated by shrinking the tree, and using the distributions found at nodes that have more samples and are less tailored to
the sample data.
However,
this cannot be taken to its extreme, since the
distribution at the root is simply an estimate of the first-order distribution of pronunciations,
and does not take the context into account at
all.
In order to avoid the problems associated with each of these two extreme
methods,
we will estimate the actual distribution at each leaf as a linear
combination
(weighted
sum)
of the distributions encountered along the
path from the root to the leaf.
tions
(the
weights)
will
be
The coefficients of this linear combinacomputed
by
means of deleted estimation
Since not all leaves are at the same depth,
the number of nodes along the
[Jeli80], as explained below.
path from a leaf to the root may vary.
tion of the weights,
it
In order to simplify the computa-
is convenient to combine a fixed number (say n) of
distributions at each leaf.
A typical value for n is 5.
utions that are used are selected as follows:
the path from a leaf to the root.
size.
root
Calculate
between S[1] and S[n],
consider the nodes along
Associated with each node is a sample
Call the sample size at the leaf S[1]
S[n].
The distrib-
and the sample size at the
the n-2 numbers that are
logarithmically
and call them S[2] through S[n-1].
spaced
Along the path
from the leaf to the root,
find for each of the S[i] the lowest node that
has at least S[i] samples.
The distributions found at these nodes will be
Decision Tree Design
64
combined to estimate the 'proper'
leaf distribution.
It can be seen that
the distributions at the leaf and at the root are used as the first and the
n'th distribution, respectively.
We now have to compute the optimal coefficients or weights
X[l. .n](L)
for
every
combined.
leaf
L,
where
n
is
the
number
Because we do not have sufficient
of
distributions
that are
sample data to accurately
estimate the optimal coefficient set for each leaf L, we divide the leaves
into classes,
and compute a set of weights
X[l..n](C)
for each such class C.
The division of leaves into classes should be such
that leaves with similar optimal coefficient sets are assigned to the same
class,
and such that there is enough data for each class to estimate these
coefficients reliably.
It was conjectured that leaves with similar sample
sizes would have similar optimal coefficient sets.
Therefore,
the range
of sample sizes was divided into intervals, and leaves with sample sizes
in the same interval were assigned to the same class.
Upon further examination of the data,
it
was found that the coefficient
values also vary rather consistently as a function of the entropy at the
corresponding
leaf: the higher the entropy,
distribution is for a given sample size.
there are more significant parameters
with higher entropy.
Thus,
the less
reliable the leaf
This may be the case because
to be estimated in a distribution
the classes that exhibited a large variation
in leaf entropy (the classes for small sample sizes) were further subdivided into entropy classes. Care was taken to ensure that there was enough
sample data for each resulting class.
For a tree with 5,000 leaves, we
typically used about 15 classes.
Decision Tree Design
65
The problem is now to determine,
for each class,
that minimizes the entropy (and thus
data.
Unfortunately,
the coefficient vector
maximizes the likelihood) of 'actual'
the distribution of the 'actual'
data is not known -
all we have is sample data, and the sample data yields a poor estimate of
the leaf distributions
for the reasons mentioned above.
In fact, if we
were to use the sample data to estimate the coefficients directly,
the
optimal weight for the leaf distributions would always be one and all other weights would always be zero.
Therefore,
we will estimate the optimal
coefficient values by means of deleted estimation, using an algorithm similar to that presented in [Jeli8O].
m batches.
the j'th
We first
A typical value for m is 4.
divide the sample data into
We then construct m trees, where
tree is built from the sample data in batches
1..j-1, j+l..m.
For
each tree,
the data in the
'held-out'
tree) can be treated as
'actual'
data.
the
X[l..n]
that
coefficient
held-out data
vector
(all m batches),
batch
(batch
For each class C,
i
for the j'th
we now compute
maximizes the probability of the
where the probability of the data in the
j 'th batch is computed with the distributions found in the j'th tree.
Within each class,
the coefficients are computed by an iterative, rapidly
converging procedure which is a simplified form of the Forward-Backward
algorithm
[Baum72].
This algorithm operates by repeatedly re-estimating
the coefficient values given the old values, the sample data and the estimators (the distributions in the tree):
Z
x
E.
where X.'
x
X.P.(x)
i
i
X.P.(x)
is the new estimate of X[i], Z
ix
denotes the sum over all samples
x in the sample data, P.(x) denotes the probability of a sample x
Decision Tree Design
accord-
66
ing to the i'th
denotes
estimator for the leaf to which x is classified,
the sum over all i.
Of course,
the samples
x from batch
classified with the j'th tree, relative to which they are 'new'
As
expected,
vectors.
typical
different
classes
give
tree.
Because
of
space limitations,
different
are
samples.
coefficient
the coefficients of only
The coefficient values after the
tenth and thirtieth iterations are shown to illustrate the conver-
gence properties of the algorithm.
0.2.
to
i
Figure 12 depicts the coefficient sets that were obtained for a
three of the fifteen classes are shown.
first,
rise
and Z.
All coefficients were initialized to
Class 1 contains all the leaves with a sample size of less than 63
and an entropy of less than 0.65; class 5 contains all the leaves with a
sample size of less than 63 and an entropy of more than 1.49; class 15 contains all leaves with a sample size greater than 541.
The threshold val-
ues that define class boundaries were chosen automatically.
The Forward-Backward algorithm has the property that the probability of
the held-out data according to the model,
XXE.
Zx Zi X.P.(x),
Xi Pix)
increases at each iteration until a stationary point is reached.
particular special case,
For this
the algorithm is guaranteed to find the global
optimum provided that the initial estimates of
X>1..n] are nonzero."1
In
practice, fifteen to thirty iterations are sufficient.
is
L. R. Bahl, personal communication
Decision Tree Design
67
After One
After Ten
Iteration
Iterations
X[1]
0.47368
0.92513
0.92854
X[2]
0.21041
0.04491
0.04220
N < 63,
X [3]
0.17609
0.02479
0.02329
H < 0.65
X[4]
0.09715
0.00413
0.00487
X[5]
0.04268
0.00104
0.00111
X[1]
0.37938
0.84593
0.88184
X[2]
0.23135
0.09329
0.07105
N < 63,
X[3]
0.20250
0.05065
0.03528
H > 1.49
X[4]
0.13431
0.00946
0.01084
X[5]
0.05246
0.00067
0.00098
X[1]
0.44915
0.99101
0.99910
X[2]
0.25744
0.00863
0.00050
X[3]
0.15930
0.00029
0.00022
X[4]
0.07414
0.00003
0.00006
X[5]
0.05996
0.00004
0.00011
Class 1:
Class 5:
Class 15:
N > 541
Figure 12.
Coefficient
Sets
and
their
After Thirty
Iterations
Convergence:
This
figure
illustrates the convergence of the coefficient values for
three representative classes.
In each class,
weight for the leaf distribution and
41[] is the
X[5] is the weight
for the root (or first-order) distribution.
Decision Tree Design
68
5.0
USING THE CHANNEL MODEL
5.1
A SIMPLIFIED
INTERFACE
The Channel Model models the mapping between words and their base forms. A
word is viewed as a letter string, and a base form is viewed as a phone
string.
Internally however, the Channel Model is implemented not in terms
of phone strings but pronunciation strings, where each pronunciation corresponds
to a single letter, and may consist of zero or more phones.
As
pointed out before, the difference between the two representations is that
a
single
phone
string
can
correspond to many different pronunciation
strings, one for each possible alignment between the letters in the word
and the phones in the base form.
Presently, the Channel Model is used for two applications.
Its main func-
tion is to operate in concurrence with the Linguistic Decoder, which finds
the most likely base form for a word, given the spelling of the word and a
single sample of its pronunciation. As a secondary function, the Channel
Model is part of a very simple base form predictor, which attempts to predict the base form of a word on the basis of its spelling alone. This simple predictor is used mostly to test the Channel Model.
As a third application, as yet unrealized,
the Channel Model could be used
in an unlimited-vocabulary speech recognition system: a 'front end' would
produce a phone sequence X,
and a decoder would find the most likely let-
ter string L given X using the a priori probability of L and the Channel
Model.
Using the Channel Model
69
The
application
programs that use the Channel Model are
interested in
finding base forms, not the various pronunciation strings to which these
may
Therefore,
correspond.
an
interface has been defined that allows
these application programs to operate in terms of phone strings only.
PATHS
5.1.1
The application programs perform their search for the 'correct' base form
in a left to right manner:
they start with the null base form, and consid-
er successively longer and longer partial solutions (and finally complete
solutions) until the best possible complete solution has been identified.
Each partial solution is a leading substring or prefix of a set of complete solutions.
Such a partial base form is called a path.
A path is
characterized by a phone string and the probability that the 'correct'
base form begins with that phone string.
If the model were very accurate,
the probability of all paths that are leading substrings of the correct
base form would be 1.0,
0.0.
In practice,
and the probability of all other paths would be
of course, there will be many other paths with a nonze-
ro probability.
The successively longer partial solutions
obtained
by
extending
and,
shorter
solutions
are
extending
a path (by a particular phone)
eventually,
paths.
the complete
The operation of
yields a new path,
the phone
string of which consists of the phone string of the original path extended
by the new phone.
The probability of the new path is the probability that
the new path is a prefix of the correct solution.
Clearly, this must be
less than or equal to the probability of the original path, with equality
only if
the new phone is the only phone by which the original path can be
extended.
found it
In order to simplify the treatment of complete paths,
convenient to adopt the following convention:
Using the Channel Model
we have
each word is fol-
70
lowed by a distinguished letter that can only have the pronunciation
#',
aad a path is complete if and only if its last symbol is the termination
W#' .
marker
bet.)
(For this purpose,
'#' is considered part of the phone alpha-
The sum of the probabilities of all different extensions of a path
(including the extension by '#') is always equal to the probability of the
path itself.
The interface to the Channel Model is defined in terms of two operators on
paths: one operator creates a null path, given a letter string; the other
operator extends a path by a given phone or by the termination marker,
yielding a new path. The implementation of the operator which creates the
null path is trivial; the implementation of the operator which extends a
path is as follows.
5.1.2
SUBPATHS
The Decision Tree module which underlies the Channel Model implementation
operates
in
Internally,
terms
of pronunciation strings rather than phone strings.
any path may correspond to several different
pronunciation
strings.
Each such pronunciation string is represented internally as a
subpath.
A complication arises because paths are extended in increments
of one phone, while pronunciation strings are extended in increments of
one pronunciation.
that they,
To straighten matters out, we have defined subpaths so
too, are extended in increments of one phone.
A subpath con-
sists of a pronunciation string, and an indication of how much of the pronunciation string is indeed part of the corresponding path.
The reason
why this indication is necessary can be illustrated as follows: consider
the word
'EXAMPLE' with base form /i g z ae m p uh 1/,
Using the Channel Model
71
and the partial base form
containing
the
(or path)
pronunciation
/i/.
extend this path by the phone /g/.
'X'
is determined to be /g z/. Thus,
/i/.
This path has single subpath,
Now,
consider
what happens when we
Internally, the pronunciation of the
the subpath can be extended to
/i/-/g z/
At
this
point,
the pronunciation
string of the subpath contains more
phones than are desired for the subpath itself (which should have a phone
string of /i g/, not /i g z/). Thus, the subpath specification is augmented
'link'
by
a
number that indicates how many of the phones
in the last
of the subpath are indeed part of the corresponding path:
/i/-/g Z/1.
Continuing the example, suppose that we now wish to extend the new path,
/i g/,
suffices to update the indicator
In this case, it
by the phone /z/.
of how many phones are part of the path. The new subpath looks like this:
/i/-/g Z/2.
A further complication to this scheme is posed by the fact that a pronunciation may not contain any phones at all.
dled in an entirely consistent manner,
This complication can be han-
as follows: consider the word
'AARDVARK' with base form /a r d v a r k/.
Assume,
for the moment,
not with the second.
that the first /a/
Thus,
is aligned with the first
the only subpath of the path /a/
'A',
looks like
this:
/a/,.
Using the Channel Model
72
Now, consider what happens when we extend the path /a/ by the phone /r/.
Internally, the pronunciation of the next letter of the word (the second
'A')
is predicted to be //
and the following subpath is created:
/a/-//.
Obviously,
this
phone string /a
cannot suffice since this subpath does not contain the
r/.
Thus,
the pronunciation of the third letter must be
predicted, yielding the subpath
/a/-//-/r/l.
In general, whenever the pronunciation of a letter can be null, the pronunciation of the next letter must be predicted as well.
If it
too can be
null, the procedure is repeated. This process continues until the end of
the word is reached, or until a letter is encountered that cannot have a
null pronunciation.
Finally,
I wi-ll illustrate with two examples the mechanism by which multi-
ple subpaths
for the same path come into existence.
examples of the same general case,
they are sufficiently different that
separate illustrations are in order.
suppose that the the first /a/
first
or the second
'A'.
Although both are
The first
example is rather trivial:
in 'AARDVARK'
can align with either the
If this is the case, then the path /a/
has two
subpaths:
/a/1 and
//-/a/,.
When this path is extended by the phone /r/,
each of its subpaths must be
extended, and two new subpaths result:
Using the Channel Model
73
/a/-//-/r/
//-/a/-/
Thus,
1
and
r/1 .
multiple subpaths may arise whenever there is an ambiguity in the
alignment.
While
pronunciation,
the
above
example
revolves
around
the
alignment ambiguities arise whenever two different pronun-
ciations can provide the phone by which the path is extended.
for example,
null
Consider,
the word
'BE' with base form /b ee/.
Since the pronunciation of the 'B' can be either /b/ or /b ee/, the path
/b/
is represented by the following two subpaths:
/b/1 and
/b ee/ 1 .
When this path is next extended by the phone /ee/, the following two subpaths may be formed:
/b/-/ee/ 1 and
/b ee/2.
When,
finally,
this path is extended by the termination marker
'#',
the
following two subpaths are formed:
/b/-/ee/-# and
/b ee/-//-#.
The first
subpath represents the
corresponds
represents
/b ee/
to
an
the
/b/
and the
'incorrect'
and the 'E'
'correct'
'E'
alignment,
alignment,
in which the 'B'
to the /ee/.
The second subpath
in which the
'B'
corresponds
to
is silent.
Using the Channel Model
74
When a path is complete,
other subpaths,
ertheless,
it
one subpath is usually much more likely than all
due to the consistency of the aligned training data.
Nev-
is necessary to keep track of all plausible alignments from
the beginning since the best alignment often cannot be immediately identified.
5.1.3
SUMMARY OF PATH EXTENSION
In order to extend a path,
it is necessary to extend all of its subpaths.
The probability of the new path can then be computed as the sum of the
probabilities
of the
new
subpaths.
To
extend a subpath,
proceed
as
follows:
1.
If
not all the phones in the last link of the subpath are
examine the first 'unused'
a.
If
it
'used',
phone:
is identical to the phone by which the subpath is to be
extended, simply make a copy of the subpath and update the length
of the new subpath.
The probability of the new subpath is the
same as that of the old subpath.
b.
If it
is different,
phone. Thus,
2.
the subpath cannot be extended by the given
the probability of the extension is zero.
If the pronunciation string of the subpath contains no extra phones,
the decision tree module must be invoked to predict the pronunciation
of
the
context,
next
letter.
The
input
to
the
decision
tree module is a
which can be formed from the spelling of the word and the
Using the Channel Model
75
pronunciation string stored in the subpath.
The position of the cur-
rent letter in the letter string can be determined by counting the
number of links in the subpath.
Note that if there are no more letters, the probability of the termination marker
Otherwise,
W#'
is one and of all other pronunciations,
zero.
several pronunciations that begin with the given phone may
have nonzero probability (for example, /b/
pronunciation,
and /b ee/).
For each such
a new subpath is created by attaching this new pronun-
ciation to a copy of the original subpath and indicating that only the
first
phone of the new link is 'used'.
The probability of each new
subpath is equal to the probability of the original subpath times the
probability of the new pronunciation given the context.
If the null pronunciation has a nonzero probability, a new subpath is
made by appending
'//'
to the a copy of the original subpath.
new subpath is then extended recursively by invoking step 2.
The subpaths generated are collected,
path
can now be calculated:
it
This
above.
and the probability of the extended
is the sum of the probabilities
of the
individual subpaths.
Since each subpath corresponds to a particular alignment between the letter string and the phone string, the system can also be used to determine
the most likely alignment
between a given (letter string, phone string}
pair. This capability is not used at present.
Using the Channel Model
76
5.2
A
A SIMPLE BASE FORM PREDICTOR
base
simple
form
which
predictor,
base forms
predicts
from letter
strings, was developed - mainly to serve as a test vehicle for the Channel
Model.
The function of the base form predictor is to identify the base form X
that has the highest likelihood of being the correct base form of a given
letter string, L:
find the X that maximizes
where p(X
p(X
I L),
I L) is computed using the Channel Model.
The base form predic-
tor operates as follows:
1.
Start with the null path n0 .
2.
Maintain a list
f (),
of paths,
which is an estimate of
the probability of the best complete path that may be formed by
extending the path i.
The path with the highest value of f'(i) is at
the head of the list.
(For efficiency, this list
implemented as a heap,
performed in
3.
ordered by f'(r),
which allows
or priority queue is
insertions and deletions to be
O(log n) time.)
Repeat the following procedure until a complete path is found at the
head of the list:
a.
Remove the best path from the head of the list.
b.
Generate all its possible extensions by invoking the Channel Model - there will be one extension for each possible phone, and one
for the termination marker '#'.
Using the Channel Model
77
c. Compute
f'(fT)
for each extension and insert it
into the list
at
its appropriate place, maintaining the ordering on f' (i).
4.
When a complete path is found at the head of the list, the algorithm
terminates.
This
algorithm,
property
that as
which is known as
long as
'stack
f'(IT)2f(T)
for all
decoding',
paths ff,
path that makes its way to the head of the list
overall ([Nils8O],
section 2.4: the A'
has the important
the first
complete
will be the best solution
algorithm).
The efficiency of the stack algorithm depends on how closely f' (r) approximates f(1r).
In the present implementation,
f'(i) is defined as the total
probability of all the complete paths that may be formed by extending the
path i.
Note that the condition f'(ff)2f(ff)
this is a rather poor estimate of f(i)
is always satisfied.
However,
and the search performed by the
predictor tends to take time exponential in the length of the base form
that is to be predicted.
be
This is not a problem, since even long words can
handled within a second or so,
and the program is used for testing
only.
5.3
THE BASE FORM DECODER
The Channel Model has been successfully interfaced to the IBM Linguistic
Decoder [Jeli75] [Jeli76],
producing a system, the Base Form Decoder, that
predicts the base form of a word on the basis of its spelling and a single
sample of its pronunciation.
The Base Form Decoder is a maximum likeli-
hood decoder: its objective is to identify the phone sequence X that has
the highest likelihood of being the correct base form of the word W, given
Using the Channel Model
78
the spelling of the word (a letter string L) and a sample of the pronunciation
of
the word
(a speech signal or acoustic signal A).
In other
words,
I L, A).
X should maximize p(X
Using Bayes'
rule, we can rewrite this expression as
p(X
I L)
I L, X)
p(A
x
p(X I L, A) =
p(A
I L)
and make the following observations:
1.
The expression p(A
I L, X) can be simplified to p(A I X), because the
dependence of A on L is through X only.
2.
The value of the denominator,
quently,
it
makes
no
p(A
I L), does not depend on X.
contribution
to the maximization
Conse-
and may be
ignored.
Thus, we can re-state our objective:
X should maximize p(X
The quantity p(X
quantity p(A
I L)
x
p(A
I X).
I L) can be computed by means of the Channel Model; the
I X), which is sometimes called the 'acoustic match', can be
computed by a means of a model of the relation between phone strings and
sounds,
such as that described in [Jeli76].
The maximization is performed
by the Linguistic Decoder by means of a stack decoding mechanism similar
to that described in the previous section, but more sophisticated [Bahl75]
(p.
408).
In particular,
it
search time is approximately
rather than exponential.
Using the Channel Model
uses a better approximation of f (),
so that
linear in the length of the decoded string
An experiment has been performed to evaluate the
79
performance of the combined system (Channel Model,
Stack Decoder):
3.34%.
the
estimated
phone
The results of this experiment
error
Acoustic Matcher and
rate is between
are described
2.14% and
in more detail
in
"Performance of the complete system" on page 83 ff..
Using the Channel Model
80
6.0
EVALUATION
This chapter consists of four sections:
*
Performance
*
Objectives Achieved
*
Limitations of the system
*
Suggestions for further research
6.1
6.1.1
PERFORMANCE
PERFORMANCE OF THE CHANNEL MODEL BY ITSELF
Two versions of the Channel Model have been constructed and tested.
model
was
trained on 4000 words
One
from a 5000-word base form vocabulary
(hereafter referred to as the OC vocabulary, for Office Correspondence);
the
other
model
unexpectedly,
ing data
was
trained
on
a
70,000-word
dictionary.
Not
the 4000 OC base forms appeared to be insufficient as train-
for the model.
The dictionary,
on the other hand,
appears to
contain enough information to properly train the Channel Model.
The model trained on the 4000 OC base forms was used to estimate a weak
fuzzy lower bound on the error rate of the Channel Model, simply by having
it predict the base forms of the words that it was trained on.
word,
the
'decoded'
phone string was aligned with the 'correct'
string using the string alignment program of [Wagn74],
Evaluation
For each
phone
and every mismatch
81
(insertion,
deletion or substitution) between the two strings was counted
as an error.
The number of errors was then divided by the number of phones
in the
'correct'
string,
to yield a
'phone
error rate
of about 1.05%,
with most of the errors being in stress placement.
An upper bound on the error rate of the Channel Model was estimated as
follows:
the model trained on the dictionary was used to predict the base
forms of 500 words from the OC vocabulary,
calculated as before.
'phone
error rate'
and a 'phone error rate'
was
Averaged over all the words in the test data, the
was 21.0%.
The 'actual'
error rate is less than 21%,
for the following reasons:
*
The
strings
'correct'
compared
are in slightly different formats,
since the
strings follow the conventions of the OC vocabulary, where-
as the 'decoded'
strings attempt to follow the conventions of the dic-
tionary. Typical discrepancies include:
*
-
/er/ vs. /uh r/,
-
/n t s/ vs. /n s/,
-
/j
-
unstressed /i/ vs. unstressed /uh/.
uu/ vs. /uu/, and
Some phones
'BITE',
are represented as two symbols; for example,
is represented as /ax
for example, /ix/
*
ixg/.
Consequently,
the 'I'
in
a substitution of,
for /ax ixg/ is counted as two errors.
A single word may have multiple base forms; if so,
only one will be
considered 'correct'.
An examination of the errors made by the Channel Model indicates that some
of these errors could either be avoided or tolerated in an application
environment,
although some errors are rather severe.
For a list
of typi-
cal errors (every tenth error) see Figure 13.
Evaluation
82
Context
Correct
Predicted
COVERAGES
/ERO/
/UHO RX/
ELEMENTARY
/ERO/
/UHO RX/
FELIX
/EE1 LX IXO/
/EH1 LX UHO/
ITINERARY
/AX1 IXG TX IX1/
/IXO TX AX1 IXG/
NEGOTIATING
/UHO/
/IXO/
PRODUCING
/Ul/
/UHO/
REWARD
/EEO/
/IXO/
SUSPECT
/UH1/
/UHO/
ACCIDENTS
/IXO/
/UHO/
CALCULATED
/IXO/
/UHO/
CRITIQUE
/IXO TX EE1/
/EE1 SH UHO/
DUPLICATION
/DX UUl/
/DX JX UUl/
GENERATING
/ERO/
/UHO RX/
INQUIRING
/IXl/
/IXO/
LOADED
/IXO/
/UHO/
PACKAGING
/IXO/
/EI1/
READS
/EE1/
/EH1/
Typical
Figure 13.
the
Channel
Channel
Model Errors:
Model
when
Selected errors made by
operating without
clues
from
acoustic information.
6.1.2
PERFORMANCE OF THE PHONE RECOGNIZER BY ITSELF
A version of the Linguistic Decoder for connected speech [Jeli76] was configured
to
vocabulary
recognize
of
the
phones.
system
by
This
a
vocabulary
unstressed vowels counted separately).
Evaluation
was done by replacing the standard
of 50 phones
(stressed and
No phonological rules were used.
83
Training-
and test data was obtained from an existing corpus of speech,
which had been recorded in a noisy (office) environment by a male talker
with a close-talking microphone.
available,
in 'sentences'
A total of 3900 words of speech were
of ten words each. These words had been drawn
from a vocabulary of 2000 different words.
matically
into single-word units,
The data was segmented auto-
and the resulting corpus was split up
into 3500 words of training data and 400 words of test data, such that no
word occurred in both.
Phone
Recognition
recognizers),
is
and it
system this simple.
only about 60%.
imum-distance
aligned if
did
not
a
rather
difficult task (at
least for automatic
would be foolhardy to expect good performance of a
Indeed, the phone recognition rate of the system was
This recognition rate was computed on the basis of a minalignment,
constrained
in
that
two phones
could not be
the portions of the speech signal to which they corresponded
overlap.
experiment.
For
Insertions
a
list
of
were
typical
not
counted
errors
as
errors
(every tenth
in
error)
this
see
Figure 14.
6.1.3
PERFORMANCE OF THE COMPLETE SYSTEM
The choice of test data for the combined system was constrained by the
limited availability of recorded speech and sample base form'data in compatible formats.
The same data used to test the phone recognizer,
was
used to test the combined system. This data does not include any speech
that contributed to the acoustic model training, and comes from a source
that is independent of the dictionary.
first
Of the 400 words of test data, the
50 words were used to test the decoder and to adjust various parame-
ters; the next 300 words were used for evaluation; the remaining 50 words
were saved for emergencies.
Evaluation
Of the 300 evaluation words, 4 turned out to
84
Context
Correct
Predicted
VARIOUSRESOLVE
/XX/
/KX/
RECEIPT
/TX/
/KX/
REQUIREMENT
/NX/
BOTTOM
/AAl/
LEARNMISSION
/XX/
PROCEED
/PX/
/KX/
POSITION
/PX/
/TX/
TRANSFER
/FX/
//
LUCK
/LX/
/WX/
EMPLOYMENT
/LX/
/wX/
FORMAL
/UHO/
HAVE
/AE1/
Figure 14.
/UHl/
/TX XX/
/AEl AAl/
Typical Phone Recognizer Errors:
the
Selected errors made by
Phone Recognizer when operating without clues from
spelling.
be
abbreviations
and
were
eliminated from further consideration.
remaining 296 words contained a total of 1914 phones,
The
or an average of 6.5
phones per word.
The decision tree used for this experiment consisted of 3424 rules with a
(weighted)
average
length of 10.29 tests.
The longest rule required 21
tests.
In
addition
to
the
problem of slight differences
in format that also
affected our Channel Model test, the combined test is complicated by the
fact that results depend on the speaker.
multiple,
Evaluation
In fact, the dictionary allows
equally likely base forms for many words and consequently the
85
system can produce different base forms for different utterances of the
same word.
The base forms produced by the combined system (296 in all)
were graded by the author, using two different evaluation standards:
*
A lower bound on the error rate was established by counting only definite errors, giving the system the 'benefit of the doubt'
in question-
able cases.
was found to
This lower bound on the 'phone error rate'
be 41+1914 = 2.14%.
*
An upper bound on the error rate was established by eliminating the
'benefit
of the doubt'
and counting all possible errors.
This upper
bound on the 'phone error rate' was found to be 64+1914 = 3.34%.
The base forms that contained
'errors'
by either of these standards are
listed in "Evaluation Results" on page 109.
It can be seen that the error rate of the combined system is much less than
that of the phone recognizer,
and much less than the upper bound on the
error rate of the Channel Model operating by itself. Nevertheless, it is
clear that the system in its present form is not ready for use 'in the real
world'.
6.2
OBJECTIVES ACHIEVED
As the reader will recall, we set out to build a system that would generate
base forms automatically
(i.e.,
without expert intervention).
This
system has been built, out of several quite independent parts:
The Channel Model represents the relation between letter strings and the
corresponding phonemic base forms:
this model is in the form of a large
collection of probabilistic spelling to base form rules. These rules were
Evaluation
86
discovered automatically,
by a very simple program,
analysis of a suitable amount of sample data.
is self-organized:
it
on the basis of an
Because of this, the system
can be applied to any collection of sample data, and
it will infer an appropriate set of spelling to base form rules from these
'examples'.
A phone recognizer is employed to estimate the probability that a particular
phone
string would have produced a given observed speech signal.
This
phone
recognizer
was
recognizer,
replacing
its
Very
knowledge of phonetics or phonology has been applied to the
little
constructing
by modifying a connected word
ordinary vocabulary by a lexicon of phones.
design of the phone recognizer:
instead,
it
is
'trained'
amount of sample speech by means of the powerful
on a suitable
'Forward-Backward'
or
'EM' algorithm.
The knowledge contributed by these two systems is combined as
Recall that each system calculates
a probability value.
follows.
If X is a phone
string, L a letter string and A an acoustic observation, then
*
p(X
I L)
is the probability that X is the base form of a word with
spelling L;
*
p(A
I X) is the probability that the base form X, when pronounced by
the talker, would have produced the speech signal A.
From these two values, we may calculate the probability of X given L and A
using Bayes'
Rule:
p(X
p(X
I L)
x
p(A I L, X)
I L, A) =
p(A
The denominator, p(A
I L)
I L), may be ignored in practice because it depends
only on A and L and is therefore constant.
Evaluation
87
The search for the most likely phone string (by the above measure)
ducted using the 'Stack Algorithm'
or A* Algorithm.
is con-
An existing implemen-
tation of this algorithm is used.
The performance figures given in the preceding section are intended as an
indication of how well a system of this type can perform in practice. The
measured phone error rate of between 2% and 4% is probably too high for
most
'real'
applications.
Nevertheless,
it
is substantially better than
the results obtained if either of the two knowledge sources is eliminated.
This confirms our conjecture that the two knowledge sources complement one
another.
Furthermore, these figures can be viewed as demonstrating the viability of
the
design
philosophy,
namely
to
self-organized methods, and as little
rely
as
much
as
possible
on
as possible on the designer's preju-
dices regarding the structure of the problem and its solution.
application of Information Theory to language is not new
While the
[Shan5l],
its
application to the formulation of spelling to sound rules and the like
appears to be novel.
The present approach was first suggested in detail
by R. Mercer.
As
the
approach
project
became
neared
completion,
apparent;
several
limitations
of
the design
they are discussed in the next section.
The
remaining section of this chapter is dedicated to potential improvements
of the system.
6.3
LIMITATIONS OF THE SYSTEM
The Channel Model utilizes a rather simple model of the relation between
letter strings and the corresponding base forms.
Evaluation
This limits the perform-
88
ance of the model in a number of ways:
*
Inversions:
Sometimes
the correspondence
between the
letters of a
word and the phones of its base form is not sequential. Because it
is
difficult to express a rule involving an inversion in terms of the
feature sets employed, these rules may not be discovered at all.
*
Transformations:
It appears that the present model is unable to cap-
ture various transformations that can take place at the phone or letter level, such as the change in the pronunciation of the 'A' when a
suffix is added to the word 'NATION':
'NATION' with base form /n 'ei sh uh n/,
'NATIONAL' with base form /n 'ae sh uh n uh 1/
Thus, each of these strings has to be modeled separately.
(In this example, the decision between /ei/ and /ae/
is made very dif-
ficult by the fact that the nearest difference in the letter strings
is more than 4 letters away).
The Channel Model is trained on a limited amount of sample data, and will
therefore have difficulty with types of data that are not represented in
the sample set, such as:
*
Special 'Words': Some vocabularies of interest, such as 'one for 'Office
Correspondence',
abbreviations
sound
rules
and
contain
other
'words'
do not apply,
a significant
number
(say, 3%)
of
to which the ordinary spelling to
such as
'CHQ',
'APL'
and 'IBMers'.
The
present system is able to deal with abbreviations and acronyms only to
the extent that examples are included in training data.
Evaluation
89
*
Proper Names:
We have found that proper names appear to require dif-
ferent spelling to sound rules than ordinary words.
Unfortunately,
the available dictionary does not contain any proper names on which
the Channel Model can be trained.
Finally,
(as
the pattern recognition algorithms and statistical procedures
applied to feature selection,
decision tree design and distribution
smocthing) have their limitations:
*
Complex Patterns:
It appears that many spelling to baseform rules are
triggered only by rather complex patterns
available features).
Unfortunately,
(complex
in terms of the
the number of bits available to
represent each pattern is limited by the depth of the decision tree,
which in turn is limited by the amount of sample data and in particular by the number of similar contexts with different pronunciations.
In practice,
the decision tree construction program is sometimes una-
ble to identify the entire pattern, and will construct a rule that can
be invoked when only a part of the pattern is present.
*
'Global'
Constraints:
ering 'global'
The system appears to have difficulty discov-
constraints
(within words),
such as the fact that each
word should contain at least one stressed syllable.
Naturally, this
knowledge could be made explicit but we would like to avoid that since
such constraints would be different for each application.
*
'New' Data:
The
smoothing
procedures
employed are rather ad hoc;
there is no reason to believe that they lead to optimal prediction of
'new' data.
Evaluation
90
6.4
SUGGESTIONS FOR FURTHER RESEARCH
I believe that with some effort, the system can be improved considerably
within
the present design framework.
The following potential
improve-
ments appear particularly promising:
*
Add a filter to the system that will recognize acronyms and abbreviations (which is relatively easy, since they are often capitalized),
and treat them specially:
tion
to
what
'spelled-out'
*
the
allow for 'spelled-out'
regular
Channel
Model
base forms in addi-
predicts
and
the
base forms from the training data.
Use a more complete dictionary as training material,
that
remove
includes inflected forms,
geographical
preferably one
and surnames and common
abbreviations.
*
*
Use a set of phonological rules to allow the phone recognizer to look
for surface forms instead of base forms [Cohe75]
[Jeli76].
Improve
by using prime-event
analysis
the pattern
[Stof74]
or
matching
by
instead of just a tree.
stage,
building
perhaps
a generalized
decision network
Consider using multi-stage optimization in
the tree design process [Hart82].
*
Improve the feature selection stage, perhaps by abandoning the limitation that features may apply to individual letters and phones only.
*
Use a more refined acoustic/phonetic
the phone recognizer,
Evaluation
'speaker
performance model'
in
for example one that takes account of context.
91
*
Reduce the amount of data presented to the decision tree design program by 'factoring'
out patterns that are repeated many times; this
will reduce the amount of computation required.
*
Improve the data smoothing procedure, perhaps on the basis of a more
accurate model of classification faults.
*
Improve
the
Recognizer
gation,
interface
with
between
regard
the
Channel
Model
and
the
Phone
to constraint propagation and error propa-
perhaps by means of a meta-level model; allow the degree of
speaker-sensitivity to be adjusted.
*
Eliminate the need for alignment of the sample data by considering all
possible alignments at all times.
*
Replace the Channel Model
models,
(which computes p(X J L))
by two separate
one of phone strings and one of the mapping from phone strings
to letter strings. The quantity p(X I L) can then be computed as
p(X) x p(L
I X) + p(L),
where p(L) may be ignored because it
does not depend on X.
Together,
the two separate models can be more powerful than the simple Channel
Model.
Evaluation
92
A.0
REFERENCES
[Alle76]
J. Allen, Synthesis of Speech from Unrestricted Text,
ceedings of the IEEE, Vol. 64, No. 4, pp. 433-442 (1976)
[Bah175]
L. R. Bahl, F. Jelinek, Decoding for Channels with Insertions,
Deletions and Substitutions with Applications to Speech Recognition, in IEEE Transactions on Information Theory, Vol. IT-21,
No. 4, pp. 404-411 (1975)
in Pro-
[Baum72] L. E. Baum, An Inequality and Associated Maximization Technique
in Statistical Estimation for Probabilistic Functions of Markov
Processes, in Inequalities, Vol. III, Academic Press, New York,
1972
[Case8l]
R. G. Casey, G. Nagy, Decision Tree Design Using a Probabilistic
Model, IBM Research Report RJ3358 (40314), 1981
[Cohe75]
P. S. Cohen, R. L. Mercer, The Phonological Component of an
Automatic Speech-Recognition System, in Speech Recognition,
Academic Press, New York, 1975, pp. 275-320
[Elas67]
J. D. Elashoff, R. M. Elashoff, G. E. Goldman, On the choice of
variables
in
classification
problems
with
dichotomous
variables, in Biometrika, Vol 54, pp. 668-670 (1967)
[Elov76]
H.
S. Elovitz,
R.
Johnson, A.
McHigh, J. E. Shore,
Letter-to-Sound Rules for Automatic Translation of English Text
to Phonetics, in IEEE Transactions on Acoustics, Speech, and
Signal Processing, Vol. ASSP-24, No. 6, pp. 446-459 (1976)
[Forn73]
G. D. Forney, Jr, The Viterbi Algorithm, in Proceedings of the
IEEE, Vol. 61, pp. 268-278 (1973)
[GaII68]
R. C. Gallager, Information Theory and Reliable Communication,
John Wiley & sons, 1968
[Hart82]
C. R. P. Hartman, P. K. Varshney, K.
Gerberich, Application of Information
struction of Efficient Decision Trees, in
Information Theory, Vol. IT-28, No. 4, pp.
G. Mehrotra, C. L.
Theory to the ConIEEE Transactions on
565-577 (1982)
[Hunn76] S.
Hunnicutt, Phonological Rules for a Text-to-Speech System,
in American Journal of Computational Linguistics, Microfiche 57
(1976)
[Hyaf76]
References
L.
Hyafil, R. Rivest, Constructing Optimal Binary Decision
Trees is NP-Complete, in Information Processing Letters, Vol.
5, No. 1, pp. 15-17 (1976)
93
[JeHi75]
F. Jelinek, L. R. Bahl, R. L. Mercer, Design of a Linguistic
Statistical Decoder for the Recognition of Continuous Speech,
in IEEE Transactions on Information Theory, Vol. IT-21, No. 3,
May 1975, pp. 250-256
[JeHi76]
F. Jelinek, Continuous Speech Recognition by Statistical Methods, in Proceedings of the IEEE, Vol. 64, No. 4, pp. 532-556
(1976)
[Jeli80]
F.
Jelinek, R. L. Mercer, Interpolated estimation of Markov
Source Parameters from Sparse Data, in Proceedings on the Workshop on Pattern Recognition in Practice, North-Holland Publishing Company, 1980, pp. 381-397
[Meis72]
W. S. Meisel, Computer-oriented approaches to pattern recognition, Academic Press, New York, 1972
[NiIs80]
N. J. Nilsson, Principles
Publishing Co., 1980
of Artificial
Intelligence,
Tioga
[Payn77] H. J. Payne, W. S. Meisel, An Algorithm for Constructing Optimal
Binary Decision Trees, in IEEE Transactions on Computers, Vol.
C-26, No. 9, pp. 905-916 (1977)
[Shan5l] Shannon, Prediction and Entropy of Printed English, in Bell Systems Technical Journal, Vol. 30, pp. 5 0 - 6 4 (1951)
[Stof74]
J. C. Stoffel, A Classifier Design Technique for Discrete Variable Pattern Recognition Problems, in IEEE Transactions on Computers, Vol. C-23, No. 4, pp. 428-441 (1974)
[Tous7l] G. T. Toussaint, Note on Optimal Selection of Independent
Binary-Valued Features for Pattern Recognition, in IEEE Transactions on Information Theory, September 1971, p. 618
[Wagn74] R. A. Wagner, M. J. Fischer, The String-to-String Correction
Problem, in Journal of the Association for Computing Machinery,
Vol. 21, No. 1, January 1974, pp. 168-173
References
94
B.0
DICTIONARY PRE-PROCESSING
Sample data for the construction of the spelling-to-baseform channel model was obtained from two sources:
1.
An on-line dictionary containing some 70,000 entries,
2.
The existing base
Research Group.
form
collection
of
the
IBM
and
Speech Recognition
This chapter explains the need for pre-processing of data from these
sources, and provides some detail regarding the methods used.
Note: The algorithms presented
Mercer and the author.
B.1
TYPES OF PRE-PROCESSING
in
this
section
represent
work by R.
NEEDED
The pronunciation information contained in our dictionary was not originally in a convenient form for further processing, for the following
reasons:
1.
The pronunciations given in the dictionary are not always base forms:
often the dictionary presents several forms of the same basic pronunciation that could all be derived from a single base form.
Thus, it is necessary to identify and remove all pronunciation entries
that are variants of some given base form.
This problem was not
addressed at this stage. Given a set of phonological rules [Cohe75],
it is a trivial matter to determine whether two phone strings are variants of each other.
2.
The pronunciations in the dictionary are sometimes incomplete:
in
order to save space, the dictionary gives the pronunciation of each
word that differs only slightly from the preceding word by merely
indicating the difference (see Figure 15).
These incomplete pronunciations are identifiable by a leading and/or a trailing hyphen.
For each incomplete pronunciation, the single correct completion had
to be selected.
This was done automatically, as described in "Base
form Completion" on page 97 ff. .
3.
The particular version of the dictionary that was available to us contains a small but not insignificant number of errors, most of which
appear to have been introduced during some transcription or translation process.
Dictionary Pre-processing
95
MAG-A-ZINE
/m 'ae g - uh - z .ee n/
/m .ae g - uh - '/
MA-TRI-MO-NIAL
/m .ae - t r uh - m 'ou
/- n j uh 1/
TOL-ER-ATE
TOL-ER-A-TIVE
TOL-ER-A-TOR
/t 'aa
TO-TA-QUI-NA
/t .aa t - uh -k
//-
/-
Figure
15.
-
n ee - uh 1/,
1 - uh - r .ei t/
r .ei t - i v/
r .ei t - uh r/
w 'ai
-n
uh/,
k (w) 'ee -/
Partially Specified Pronunciations:
In the dictionary,
pronunciations are not always completely specified.
A match score was computed for each (spelling, pronunciation) pair and
the dictionary was sorted by this match score. The part of the dictionary that contained the low-scoring words was then examined manually,
and any pronunciations that were found to be in error were
either corrected or eliminated.
4.
The dictionary contains no entries for regular inflected forms; this
affects plurals, present and past tenses and participles, and comparatives and superlatives.
A program was written to inspect an ordinary word frequency list (of
about 65,000 words).
For each inflected form that occurred in this
list but not in the dictionary, the program retrieved the pronunciation of the root from the dictionary, and generated the inflected form
by means of rule application.
5.
The dictionary contains no explicit information regarding the alignment between the letters
in the words and the phones in the base
forms.
The alignment was performed automatically, by means of the trivial
spelling-to-baseform channel model described below.
See "Base form
Alignment" on page 97 ff. for details.
B.2
THE TRIVIAL SPELLING-TO-BASEFORM CHANNEL MODEL
Note: This model was originally designed and implemented by R. Mercer
A trivial
spelling to baseform channel model was needed to calculate match
scores between letter
strings and phone strings, and to align them.
In
this model,
a letter
is modeled as a Markov Source (or finite-state
machine) with an initial and a final state, in which each arc or transi-
Dictionary Pre-processing
96
tion corresponds to a phone. A letter string may be modeled by connecting
together the Markov Sources corresponding to the individual letters in the
string. The transition probabilities of each Markov Source are independent of the context in which it occurs.
The Markov Source model was chosen
mostly because algorithms and programs are available for constructing and
utilizing such a model.
Despite the obvious shortcomings of this model,
it does appear to be sufficiently powerful for the purposes for which it
is used (alignment and base form completion).
The transition probabilities of the Markov Source model (the only parameters in the model) were computed with the Forward-Backward algorithm
[Jeli76] [Baum72], using the dictionary as sample data.
B.3
BASE FORM MATCH SCORE DETERMINATION
The Markov Source model of base forms, as described above, is capable of
computing the probability that any phone string X is the base form of a
given letter string L.
This probability can be used as a match score
between a letter string and a base form.
The most serious shortcoming of the model is in its lack of context sensitivity. While many letters are often silent (some with a probability as
high as 50%), the model is not aware that 'runs'
of silent letters are
much less likely. As a result, the model over-estimates the probability
of short phone strings.
In an attempt to compensate for this shortcoming
in the model, we have used several different functions as estimates of the
match value between a letter string L (of length n) and a base form X:
*
if we assume that the model is correct,
between L and X as p(X I L)
*
if we view p(X I L) as the product of the individual phone probabilities, we can define the match score as the (geometric) average probability per predicted phone, n/p (X
*
we can define the match score
I L)
finally, we can normalizes the probability p(X I L) by dividing by the
expected probability of a correct phone string of the same length, and
define the match score as p(X
I L) + cn for a suitable constant c.
Since the match calculations were used only for dictionary processing prior to manual editing, no attempt was made to formally evaluate the effectiveness of each of these measures. Of course, any attempt to evaluate
them would have required the manual analysis of a large amount of data.
Dictionary Pre-processing
97
B.4
BASE FORM COMPLETION
In the dictionary available to us, many pronunciations are specified
incompletely, with a hyphen indicating that a portion of the preceding
pronunciation is to be carried over to complete the pronunciation (see
Figure 15 on page 95).
By convention, only an integral number of
(phonemic)
syllables
may
be
carried
over
from
the
preceding
pronunciation. A number of other, unstated rules apply such as the rule
that any inserted string (of the form -x-) must always replace at least
one syllable: otherwise it would be difficult to determine where the
string should be inserted. As a result, the number of possible completions is always relatively small. For example, there are only two ways
of completing the incomplete pronunciation for the word 'TOTAQUINA' (refer to Figure 15 on page 95):
/t .aa t k (w) 'ee k w 'ai n uh/
/t .aa t uh k (w) 'ee n uh/
A user of the dictionary will probably have little
difficulty identifying
the second option as the correct one. The base form completion program
attempts to do the same: it calculates the match score between the letter
string and each of the proposed base forms, and simply selects the one
with the highest score.
B.5
BASE FORM ALIGNMENT
The Markov Source model of the spelling-to-baseform mapping can be used to
determine the most likely alignment between the letters in a letter string
and the phones in a phone string, by means of a dynamic programming algorithm known as the Viterbi algorithm (see [Forn73]). This algorithm was
used to align a total of 70,000 (spelling, baseform) pairs.
Although we
never formally verified the 'correctness' of the alignments thus
obtained, we did not find any unacceptable alignments in any of the hundreds of cases we inspected.
In fact, it is not easy to define 'correctness' in this context. In case of doubt, we always accepted the alignment
produced by the Viterbi program, to ensure consistency throughout the
training set.
Dictionary Pre-processing
98
C.0
THE CLUSTERING ALGORITHM
Note: The
Mercer, M.
C.1
algorithms presented in
A. Picheny and the author.
this
section
represent
work by R.
OBJECTIVE
The clustering program described here is quite general in scope, and has
been used for a variety of purposes besides feature selection.
The program requires the following inputs:
1.
An array describing the joint probability distribution of two random
variables. The rows correspond to the entities that are to be clustered; the columns correspond to the possible values of the unknown
that is to be predicted.
2.
An optional specification of a partition of the rows into some number
of 'auxiliary clusters'. These auxiliary clusters represent the combined answers to any previous questions about the rows.
The algorithm attempts to find a binary partition of the rows so that the
conditional mutual information between the two resulting clusters and the
columns, given the information provided by the 'auxiliary clusters', is
maximized. If several possible partitions offer the same value of this
primary objective function, ties are resolved in favor of a secondary
objective function, namely the unconditional mutual information between
the two resulting clusters and the columns.
We can define this formally
as follows: The joint probability distribution is of the form
p(L, x),
where L denotes the 'current letter' of a context and x its pronunciation.
The objective of the program is to define a binary partition,
C: L
+
(1, 2)
that assigns every letter either to cluster no. 1 or to cluster no.
'auxiliary clusters' provided as input define a mapping
A: L + {1,
2,
The
... , n}
that assigns every letter to one of n classes.
vided as input, n=1. Our primary objective is to
maximize MI(
2.
C(L); x I A(L)
If no partition is pro-
);
when there are several solutions, our secondary objective is to
The Clustering Algorithm
99
maximize MI( C(L); x ).
C.2
IMPLEMENTATION
Merging clusters
The program starts by placing each letter L in a cluster all its own.
(Since this gives us more than two clusters, we need to re-define C as
C: L +
1,
2,
. .. , m}
where m initially is equal to the number of letters.)
The program then
determines by how much the value of the objective function would drop if
it combined any two clusters into one, and it combines those two clusters
for which the resulting drop is the smallest. This process is repeated
until only some small number of clusters, typically between two and fifteen, remain.
Finding the best binary partition of a small set of clusters
For a small number of clusters n, it is possible to evaluate the objective
function for every possible way of combining these n clusters into two,
and thereby to identify the optimal combination.
If we arbitrarily identify the two final clusters with the values '0' and '1', every possible
combination can be specified as a vector of n bits, in which each bit
indicates whether the corresponding cluster becomes part of cluster '0' or
cluster '1'.
If we interchange the roles of '0' and '1', the partition
defined by such a bit vector remains the same. Therefore, 2 (n-1) different combinations are possible. If we enumerate these combinations using a
Gray code, only one cluster changes place between every pair of successive
combinations, and the value of the objective functions can easily be computed in an incremental fashion.
This method is guaranteed to find the best way of reducing the number of
clusters from n to two.
It often yields a better result than would be
obtained by continuing the repeated merging procedure of the previous section until n=2.
Swapping and Moving elements
In an attempt to further improve the binary partition obtained by one of
the above methods, the clustering program considers all possible ways of
moving a single letter from one cluster to another, as well as all ways of
swapping two letters that are in opposite clusters. The swap or move that
improves the value of the objective function by the greatest amount is
performed.
This process of swapping and/or moving continues until no single swap or move can further improve the value of our objective function.
The Clustering Algorithm
100
While the cost of this post-processing stage is substantial, the method
often yields significant improvements. In fact, in all test cases to
which the algorithm has been applied, it has made the enumeration process
described in the preceding section superfluous: the sequence 'merge until
n=2, swap/move until done' always produced the same result as the sequence
'merge until n=l4, enumerate and find best combination for n=2, swap/move
until done'.
For some of the larger clustering tasks, the swap/move stage was omitted
because it would take too long.
The Clustering Algorithm
101
D.0
LIST OF FEATURES USED
This section contains a listing of the features that were obtained by the
feature selection program using various optimality criteria.
QUESTIONS ABOUT THE CURRENT LETTER
D.1
The following questions about the current letter were obtained by maximizing
MI(QLi(L) ; X(L)
I QL1(L),QL2(L),.
. .
, QLi-l(L)),
while resolving ties by maximizing
MI(QLi(L) ; X(L)).
QL1 -
Mutual Information: 0.865437 bits.
True: BCDFKLMNPQRSTVXZ
False: #AEGHIJOUWY'
QL2 -
Conditional MI: 0.721985 bits, unconditional MI: 0.714513 bits.
True: ACDEIKMOQSTUXYZ'
False: #BFGHJLNPRVW
Classes defined by QLl and QL2:
#
A
B
C
QL3
-
G
E
F
D
H
I
L
K
J
0
N
M
W
U Y'
P R V
Q S T X Z
Conditional MI: 0.629876 bits, unconditional MI: 0.651345 bits.
True: #BCEFIKNPQSVXYZ'
False: ADGHJLMORTUW
Classes defined by QL1,
A
B
C
D
E
G
L
0
F
K
M
I
H
R
QL2 and QL3:
U
N P V
Q S X Z
T
Y
J W
List of Features Used
102
QL4 -
Conditional MI: 0.534354 bits, unconditional MI: 0.595232 bits.
True: #ABCDEFGJKMPQRVX'
False: HILNOSTUWYZ
Classes defined by QL1 through QL4:
GJ, HW, IY, L, N, OU, R, SZ, T.
QL5
#, A, BFPV, CKQX, DM, E',
Conditional MI: 0.190333 bits, unconditional MI: 0.611793 bits.
-
True: ACDEFGHIKLOPQST
False: #BJMNRUVWXYZ'
Classes defined by QL1 through QL5:
letters.
QL6
BV, CKQ,
FP and individual
Conditional MI: 0.050348 bits, unconditional MI: 0.619650.
-
True: CDFILMSVYZ
False: #ABEGHJKNOPQRTUWX'
Classes defined by QL1 through QL6: KQ and individual letters.
QL7
Conditional MI: 0.000844 bits, unconditional MI: 0.670228 bits.
True: BCDFQRSVZ
False: #AEGHIJKLMNOPTUWXY'
QL1 through QL7 identify each letter uniquely.
D.2
QUESTIONS ABOUT THE LETTER AT OFFSET -1
The following questions about the letter to the left of the current letter
were obtained by maximizing
MI(QLLi(LL) ; X(L) I L, QLL1(LL),QLL2(LL),..., QLLi-1(LL)),
(where L is the current letter, X(L) is the pronunciation of the current
letter, and LL is the letter to the left of the current letter), while
resolving ties by maximizing
MI(QLLi(LL) ; X(L)
I L).
Although this quantity is conditional on L, I will refer to it
ditional' to distinguish it from the former quantity.
QLL1 -
as 'uncon-
Mutual Information: 0.108138 bits
true: #AEIOQUWY
False: BCDFGHJKLMNPRSTVXZ'
List of Features Used
103
QLL2 -
Conditional MI: 0.076817, unconditional MI: 0.053857
true: #AHILNRSUVX'
False: BCDEFGJKMOPQTWYZ
Classes defined by QLL1 and QLL2:
#
B
E
H
QLL3
-
A
C
O
L
I
D
Q
N
U
F G JK M P T Z
W Y
R S V X
Conditional MI: 0.069841, unconditional MI: 0.052867
true: #ABCFHJMNOPSWX
False: DEGIKLQRTUVYZ'
Classes defined by QLL1, QLL2 and QLL3:
#
D
B
E
H
I
L
O
QLL4
-
A
G
C
Q
N
U
R
w
K T Z
F JM P
Y
S X
V
Conditional MI: 0.067447, unconditional MI: 0.052367;
true: #BDFJQRSUWXYZ'
False: ACEGHIKLMNOPTV
Classes defined by QLL1 through QLL4:
GKT, HN, I, LV, 0, QY, R', SX, U, W.
QLL5
-
#, DZ, A,
BFJ, CMP,
E,
Conditional MI: 0.018370, unconditional MI: 0.052650;
true: #BDEGHJMVWXY'
False: ACFIKLNOPQRSTUZ
Classes defined by QLL1 through QLL5:
letters.
QLL6 -
BJ, CP, KT and individual
Conditional MI: 0.005148, unconditional MI: 0.055200;
true: #BFHPT
False: ACDEGIJKLMNOQRSUVWXYZ'
List of Features Used
104
QLL1 through QLL6 uniquely identify each letter.
D.3
QUESTIONS ABOUT THE LETTER AT OFFSET +1
The following questions about the letter to the right of the current letter were obtained by maximizing
MI(QLRi(LR) ; X(L)
I L, QLR1(LR),QLR2(LR),..., QLRi-1(LR)),
(where L is the current letter, X(L) is the pronunciation of the current
letter, and LR is the letter to the right of the current letter), while
resolving ties by maximizing
MI(QLRi(LR) ; X(L)
I L).
Although this quantity is conditional on L, I will refer to it
ditional' to distinguish it from the former quantity.
QLR1 -
as 'uncon-
Mutual Information: 0.142426 bits
True: #AEIORUWY'
False: BCDFGHJKLMNPQSTVXZ
QLR2 -
Conditional MI: 0.089281, unconditional MI: 0.068588;
True: #ABCEFJLMPQSTUVW
False: DGHIKNORXYZ'
Classes defined by QLR1 and QLR2:
# A E U W
B C F J L M P Q S T V
D G H K N X Z
I R Y
QLR3
-
Conditional MI: 0.092730, unconditional MI: 0.061178;
True: #FIJLNTWZ
False: ABCDEGHKMOPQRSUVXY'
Classes defined by QLR1, QLR2 and QLR3:
#
A
B
D
F
W
E
C
G
J
U
M P Q S V
H K X
L T
List of Features Used
105
I
N Z
0 R Y'
QLR4
-
Conditional MI: 0.073014, unconditional MI: 0.081467;
True: #BDEFHILMVY
False: ACGJKNOPQRSTUWXZ'
Classes defined by QLR1 through QLR4:
FL, GKX, I, JT, NZ, OR', W, Y
QLR5
-
Conditional MI: 0.041879,
BMV, CPQS,
#, AU,
DH, E,
unconditional MI: 0.085400;
True: #BDFJKQRSTUWZ'
False: ACEGHILMNOPVXY
JT, MV,
Classes defined by QLR1 through QLR5: CP, GX,
inidvidual letters.
QLR6
-
Conditional MI: 0.005090,
QS,
R'
and
unconditional MI: 0.051465;
True: #ABCEFHIJLNOPQRUVXY
False: DGKMSTWZ'
Classes defined
letters.
QLR7
-
by
QLR1
through
QLR6:
CP
and
individual
Conditional MI: 0.001989, unconditional MI: 0.088821;
True: #ABDEFGIJKLOPQRSTUWXYZ'
False: CHMNV
QLR1 through QLR7 define all letters uniquely.
D.4
QUESTIONS ABOUT PHONES
The following questions phones were obtained by maximizing
MI(QPi(P) ; X(L)
I QP1(P),QP2(P),..., QPi-l(P)),
(where X(L) is the pronunciation of the current letter, and P is the last
phone before the pronunciation of the current letter), while resolving
ties by maximizing
MI(QPi(P) ; X(L)).
QP1 -
Conditional MI: 0.265914,
List of Features Used
unconditional MI: 0.265914
106
True: UHO UH1 ER1 AE1 EI1 AA1 AX UXG EHO EH1 EE1 IXO IX
IXG OU1 AW1 UUl UX
False: ERO BX CH DX EEO FX GX HX JH KX LX MX NX PX RX SX SH
TX TH DH UUO VX WX JX NG ZX ZH #
QP2
-
Conditional MI: 0.164915,
unconditional MI: 0.128966
True: ERO AX EEG IXO TH UUO NG ZX #
False: UHO UHi ER1 AE1 EIl AA1 UXG BX CH DX EHO EH1 EE1 FX GX
HX I IXG JH KX LX MX NX OUI AW1 PX RX SX SH TX DH UU1 UX
VX WX JX ZH
Classes defined by QP1 and QP2:
*
*
*
*
QP3 -
UHO UH1 ER1 AE1 EIl AAO AA1 UXG EHO EH1 EE1 IX IXG OU1 AW1
UU1 UXi
ERO EEO TH UUO NG ZX #
AX IXO
BX CH DX FX GX HX JH KX LX MX NX PX RX SX SH TX DH VX WX JX ZH
Conditional MI: 0.116981,
unconditional MI: 0.076418
True: ERO ER1 EI1 AX EEO EE1 IXG LX MX NX OU1 RX SX TH UUO
UU1 NG ZX
False: UHO UH1 AEl AA1 UXG BX CH DX EHO EHM FX GX HX IXO IX
JH KX AW1 PX SH TX DH UX VX WX JX ZH #
Classes defined by QP1 through QP3:
QP4
-
*
*
*
*
*
*
*
UHO UH1 AEl AA1 UXG EHO EHi IX AW1 UXi
ER1 EIl AAO EEl IXG OU1 UUi
ERO EEO TH UUG NG ZX
AX
BX CH DX FX GX HX JH KX PX SH TX DH VX WX JX ZH
IX0
LX MX NX RX SX
*
#
Conditional MI: 0.095049,
unconditional MI: 0.062220
True: ER1 CH DX EEO EEl IXG JH NX OU1 AW1 SX SH TX TH DH UU1
UX VX JX NG ZX ZH
False: UHO UH1 ERO AE1 EIl AA1 AX. UXG BX EHO EH1 FX GX HX
IXO IXi KX LX MX PX RX UUO WX #
Classes defined by QP1 through QP4: UHO UH1 AEl AA1 UXG EHO EHM
IXi, EIl AAO, ERO UUO, ER1 EE1 IXG OU1 UUi, AX1, BX FX GX HX KX
PX WX, CH DX JH SH TX DH VX JX ZH, EEG TH NG ZX, IXO, LX MX RX,
NX SX, AW1 UX1, #.
List of Features Used
107
QP5 -
Conditional MI: 0.077567,
unconditional MI: 0.037309
True: UHO AX UXG DX GX IXG KX MX OU1 PX SX TX UUO UX NG
false: UH1 ERO ER1 AEl EIl AAl BX CH EHO EH EEO EEl FX HX
IXO IX JH LX NX AW1 RX SH TH DH UUl VX WX JX ZX ZH #
QP6 -
Conditional MI: 0.042339,
unconditional MI: 0.054139
True: AE1 EIl UXG BX DX EHO EH EEl FX IX IXG JH KX LX SX TH
DH UUO UU1 VX NG ZX #
False: UHO UH1 ERO ER1 AA1 AX CH EEO GX HX IXO MX NX OU1 AW1
PX RX SH TX UX1 WX JX ZH
QP7 -
Conditional MI: 0.017873,
unconditional MI: 0.046538
True: ERO AE1 AA1 AXl BX CH DX EEO EE1 GX HX IXO IX NX AW1
TX TH DH UX1 VX JX ZH #
False: UHO UH1 ER1 EIl UXG EHO EH1 FX IXG JH KX LX MX OU1 PX
RX SX SH UUO UUl WX NG ZX
Classes defined by QPl through QP7:
DH VX, and individual phones.
QP8 -
Conditional MI: 0.003825,
AE1 IX,
CH JX ZH, EHO EH1,
unconditional MI: 0.039209
True: UHO ERO ER1 EIl UXG CH EHM EEO EE1 HX IX IXG LX MX OU1
RX SX SH TH DH UUO UU1 WX NG ZX ZH
False: UH1 AE1 AA1 AX BX DX EHO FX GX IXO JH KX NX AW1 PX TX
UXl VX JX #
Classes
phones.
defined
List of Features Used
by
QP1
through
QP8:
CH ZH, and individual
108
E.0
OVERVIEW OF THE DECISION TREE DESIGN PROGRAM
The basic control structure of the decision tree design program is illustrated below.
The details of the mutual information computation and the
termination tests are omitted.
As illustrated, the program retains the
sample data that corresponds to terminated (unextended) nodes.
design decisiontree: procedure;
do for level = 1, 2, 3, ...
call processnodesatcurrentlevel();
if there are no extendable nodes at the next level,
then stop, else continue;
end;
end;
process_nodesatcurrentlevel: procedure;
create an empty output file;
open the input file;
do for all nodes at this level, from left to right;
call processnodeo;
end;
call copydata(;
replace the input file by the (now full) output file;
end /* of level */
processnode: procedure;
call copy data(;
read the data for the current node;
determine decision function for the node,
or decide not to extend it;
if the node is extended then
read data for node, copying data for left subtree to output file;
read data for node, copying data for right subtree to output file;
else
read data for node, copying it to output file;
end /* of node */
copydata: procedure;
read data from input file, copying it to output file,
either until a sample for the current node is found,
or until the end of the input file is reached.
end;
Overview of the Decision Tree Design Program
109
F.0
EVALUATION RESULTS
The following is a listing of the words whose base forms were predicted
incorrectly in the 'final experiment' by either of the two scoring procedures used (see "Performance of the complete system" on page 83).
Some
words occurred several times in the sample data; the number of times that
each word occurred in given following the spelling of each entry. For
words that occurred more than once, repeated errors were counted separately. For example, the word 'SINGLE' occurred five times, and an erroneous
base form was produced four out of five times.
The base forms are given in the notation employed by [Cohe75]; a single
digit at the end of each vowel phone represents that stress on that vowel,
as follows:
0
1
unstressed
stressed.
The number of errors found by each of the scoring procedures is indicated
to the right of each entry:
L
H
(for Low) indicates the number of definite errors
(for High) indicates the number of possible errors.
single
single
single
single
generated
professionals
reduced
remove
remove
sometime
Hugh
Hugh
anxious
apparently
appear
appeared
arranged
associated
capabilities
caused
checked
content
copier
criteria
distribute
distributor
Evaluation Results
SX IX NG UHO LX
SX IX NG UHO LX
SX IX NX UHO LX
SX IX NG UHO LX
JH EH1 NX RX EIl TX IXO DX
PX RX UHO FX EH1 SH UHO LX ZX
RX IXC DX UU1 SX TX
RX EEO MX OU1 VX
RX IXO MX UU1 FX
SX UH1 MX TX AX IXG MX
HX JX UUl GX
HX JX UU1 GX
EIl NG KX SH UHO SX
UHO PX UHO RX EH1 NX TX LX EEO
UHO PX JX ERO
UHO PX JX ERO UHO DX
UHO RX RX EI1 NX JH UHO DX
UHO SX OUl SX EEC EIl TX IXO DX
KX EIl PX UHO BX UHO LX IXO TX EEC ZX
KX AW1 UH1 ZX TX
CH EH1 KX UHO DX
KX AA1 NX TX EH1 NX TX
KX AA1 PX JX ERO
KX RX AX IXG TX JX RX EEC UHO
DX IXO SX TX RX UHO BX JX UU1 TX
DX IXO SX TX RX UHO BX JX UHO TX ERO
L H
1 1
1 1
2 2
1 1
1
2 2
1
1 1
1 2
1
1 1
1 1
1
2 2
1 1
2 2
1 1
1
1
1 1
1 1
1
1 1
1 1
2 2
1 1
110
earliest
elements
examination
insured
join
lower
magazine
participated
pilot
practices
raised
ready
reflect
relating
represented
resolved
resume
say
significantly
supplemental
ER1
EEO EHl SX TX
EHl
MX UHO NX TX SX
IXc GX
LXI ZX AEl MX UHO NX EIl SH UHO NX
LXI SH UX RX DX
IXl NX
JH
IXG NX
LX
RX
MX UHO GX EIl ZX EEl NX
PX AA RX TX IX SX UHO PX EIl TX IXO DX
PX AX IXG UHO LX UHO TX
PX RX AE1 KX TX SX NX ZX
RX EIl SX TX
RX EIl DX EEO
RX IXC FX LX EH KX TX
RX IXC LX EIl TX IXO NG
RX EHI PX RX IXO ZX EHO NX TX IXO DX
RX IXC ZX AW1 LX PX EEO DX
RX IX0 SX UU1 MX
SX EHI EEO
SX IX1 GX NX IX FX IXO KX UHO NX TX LX "E
SX UH1 PX UHO LX MX EH1 NX TX UHO LX
total
Calculated Error Rates:
Evaluation Results
1 1
121
1
1
1
2 2
2 3
1
1
2 2
2 2
1 1
1
1
1 2
2 3
1 2
2
1
2 3
41 64
Low Estimate
High Estimate
41 + 1914 = 2.14%
64 + 1914 = 3.34%
ill
Download