Using Graphone Models in Automatic ... Recognition Stanley Xinlei Wang JUL 2

advertisement
Using Graphone Models in Automatic Speech
MASSACHUSETTS INSTI'JTE
OF TECHNOLOGY
Recognition
by
JUL 2 0 2009
Stanley Xinlei Wang
LIBRARIES
S.B. Massachusetts Institute of Technology (2008)
Submitted to the Department of Electrical Engineering and Computer
Science
in partial fulfillment of the requirements for the degree of
Master of Engineering in Electrical Engineering and Computer Science
at the
MASSACHUSETTS INSTITUTE OF TECHNOLOGY
June 2009
@ Massachusetts Institute of Technology 2009. All rights reserved.
Author ..
...........
.............
Department of Electrical Engineering and Computer Science
A
May 22, 2009
Certified by
(
James R. Glass
Principal Research Scientist
, Thesis Supervisor
A
Certified by..............
(I. Lee Hetherington
~-
~~ ---WA~.
Research Scientist
S- Co-Supervisor
Accepted by...
At
C...........
Arthur C. Smith
Chairman, Department Committee on Graduate Theses
ARCHIVES
Using Graphone Models in Automatic Speech Recognition
by
Stanley Xinlei Wang
Submitted to the Department of Electrical Engineering and Computer Science
on May 22, 2009, in partial fulfillment of the
requirements for the degree of
Master of Engineering in Electrical Engineering and Computer Science
Abstract
This research explores applications of joint letter-phoneme subwords, known as graphones, in several domains to enable detection and recognition of previously unknown
words. For these experiments, graphones models are integrated into the SUMMIT
speech recognition framework. First, graphones are applied to automatically generate
pronunciations of restaurant names for a speech recognizer. Word recognition evaluations show that graphones are effective for generating pronunciations for these words.
Next, a graphone hybrid recognizer is built and tested for searching song lyrics by
voice, as well as transcribing spoken lectures in a open vocabulary scenario. These
experiments demonstrate significant improvement over traditional word-only speech
recognizers. Modifications to the flat hybrid model such as reducing the graphone set
size are also considered. Finally, a hierarchical hybrid model is built and compared
with the flat hybrid model on the lecture transcription task.
Thesis Supervisor: James R. Glass
Title: Principal Research Scientist
Thesis Co-Supervisor: I. Lee Hetherington
Title: Research Scientist
Acknowledgments
I am grateful to my advisor Jim Glass for his guidance and support. His patience
and encouragement has helped me significantly in my past two years at MIT. I would
also like to thank my co-advisor, Lee Hetherington, for his guidance and for helping
me with building speech recognition components using the MIT FST Toolkit. I am
especially appreciative of being able to come by Jim and Lee's offices almost anytime
when I have questions, and get them answered very quickly.
Much thanks to Ghinwa Choueiter for teaching me when I first joined the group,
and for helping me with the word recognition and lyrics experiments. I would like to
thank Paul Hsu for discussions on NLP topics and effective algorithm implementations, Hung-An Chang for help with the lecture transcription experiment, Stephanie
Seneff for discussions on subword modeling, and Scott Cyphers for assistance on getting my scripts running. Furthermore, I would like to thank Mark Adler and Jamey
Hicks at Nokia Research Cambridge for giving me the opportunity to perform speech
research during a summer internship, as well as Imre Kiss, Joseph Polifroni, Jenine
Turner, and Tao Wu for their help during my internship.
I am thankful to my friends at MIT for keeping me motivated. Finally, I would
like to thank my parents for their continued encouragement and support.
This research was sponsored by Nokia as part of a joint MIT-Nokia research
collaboration.
Contents
1
2
Introduction
1.1
Motivation .................
1.2
Summary of Results
1.3
Outline ..................
........
......
17
......
...................
........
.
2.2
2.3
20
22
.......
23
Background
2.1
3
17
Working with OOVs
23
........
.........
2.1.1
Vocabulary Selection and Augmentation
2.1.2
Confidence Scoring ...................
2.1.3
Filler Models
2.1.4
Multi-stage Combined Strategies
.
. .........
.....
...................
26
......
.
. ...............
........
Graphone Letter-to-Sound Conversion
2.2.2
Graphone Flat Hybrid Recognition . ..............
.
. ............
36
37
Restaurant and Street Name Recognition
Restaurant Data Set ...................
3.2
Methodology
3.3
Results and Discussion ...................
29
29
SUMMIT Speech Recognition Framework . ...............
3.1
27
28
Graphone Models ...................
2.2.1
24
39
........
..
.........
...................
.......
3.3.1
Graphone Parameter Selection . .................
3.3.2
Adding Alternative Pronunciations
3.3.3
Generalizability of Graphone Models . .............
. ..............
39
40
41
41
43
45
3.4
4
5
6
Summary
.
Lyrics Hybrid Recognition
4.1
Lyrics Data Set ......
4.2
Methodology
4.3
Results and Discussion .
4.4
Controlling the Number of Graphones.
4.5
Summary
.......
.........
Spoken Lecture Transcription
61
5.1
Lecture Data Set .............................
61
5.2
Methodology
62
5.3
Results and Discussion .
5.4
Controlling the Number of Graphones ....
.. . . .
. . . . ..
68
5.5
Hybrid Recognition with Full Vocabulary . . . . . .
. . . . .....
.
69
5.6
Hierarchical Alternative for Hybrid Model
. . . . .....
.
72
5.7
Summary
.....
75
...............................
.........................
. . . . .
63
..........................
Summary and Future Directions
77
6.1
77
Future Directions .............................
A Data Tables: Restaurant and Street Name Recognition
81
B Data Tables: Lyrics Hybrid Recognition
83
C Data Tables: Spoken Lecture Transcription
85
List of Figures
1-1
Vocabulary sizes of various corpora as more training data is added
(from [25]).
1-2
19
New-word rate as the vocabulary size is increased for various corpora
(from [25]).
2-1
..
............
...................
.......
..........
.......
.......
20
Generative probabilistic process for graphones as described by Deligne
et al. According to an independent and identically-distributed (iid)
distribution, a memoryless source emits units gk that corresponds to
Ak and Bk, each containing a symbol sequence of variable length.
2-2
. .
30
Directed acyclic graph representing possible graphone segmentations of
the word-pronunciation pair (far, f aa r). Black solid arrows represents
graphones of up to one letter/phoneme. Green dashed arrows (only a
subset shown) represent graphones with two letters or two phonemes.
Traversing the graph from upper-left corner to lower-right corner generates a possible graphone segmentation of this word. . .........
3-1
31
Word recognition performance using automatically generated pronunciations produced by a graphone L2S converter trained with various L
and N values ..................
3-2
............
Word-recognition performance using top 2 pronunciations generated
using various L and N values. ...................
3-3
42
...
44
Word recognition performance with various number of pronunciations
generated using the graphone (L = 1, N = 6) and spellneme L2S
converters ...................
...............
46
3-4
Word recognition performance for seen words when the top 1 pronunciation is used.
3-5
..
..
..
.. .
...
.........
47
Word recognition performance for unseen words when the top 1 pronunciation is used.........
3-6
. . . .
. .
....
..........
.
47
Word recognition performance for seen words when the top 2 pronunciations are used .........
...
.................
. . .
48
3-7 Word recognition performance for unseen words when the top 2 pronunciations are used ........
4-1
.........
.....
.....
.
48
Recognition word error rate (WER) for recognizers with 50-99% training vocabulary coverage. Word & OOV is replacing each output graphone sequence by an OOV tag. Word & Graphone is concatenating
consecutive graphones in the output to spell OOV words. .......
4-2
.
55
Recognition sentence error rate (SER) for recognizers with 50-99%
training vocabulary coverage. Word & OOV is replacing each output
graphone sequence by an OOV tag. Word & Graphone is concatenating
consecutive graphones in the output to spell OOV words. .......
4-3
.
56
Recognition letter error rate (LER) for recognizers with 50-99% training vocabulary coverage. Word & Graphone is concatenating consecutive graphones in the output to spell OOV words. . ........
5-1
. .
57
Recognition word error rate (WER) for 88-99% training vocabulary
coverages. Word & OOV is replacing each output graphone sequence
by an OOV tag. Word & Graphone is concatenating consecutive output
graphones to spell OOV words.
5-2
.....................
. . .
65
Recognition sentence error rate (SER) for 88-99% training vocabulary
coverages. Word & OOV is replacing each output graphone sequence
by an OOV tag. Word & Graphone is concatenating consecutive output
graphones to spell OOV words.
...................
..
66
5-3
Recognition letter error rate (LER) for 88-99% training vocabulary coverages. Word & Graphone is concatenating output graphone sequences
to spell OOV words.
5-4
...........
......
.
66
Test set perplexity of language models trained on a replicated training
corpus ..................
5-5
. ....
..........
......
70
Hierarchical OOV model framework (from [2]). Woov represents the
OOV token. ..............
..................
73
12
List of Tables
2.1
Examples of graphone segmentations of English words. These maximumlikelihood segmentations are provided by a 5-gram model that allows
30
a maximum of 4 letters/phonemes per graphone. . ............
3.1
Examples of the L2S converter's top 3 hypotheses using the graphone
language model. ...................
4.1
..........
..
Vocabulary size of hybrid recognizers, and their vocabulary coverage
levels on the training and test sets. . ...................
4.2
43
53
Example recognizer output at the 92% vocabulary coverage level. Square
brackets denote concatenated graphones (only letter part shown). Ver56
tical bars are graphone boundaries. ....................
4.3
Number of graphones for L = 4 unigram model after trimming using
the specified Kneser-Ney discount value.
4.4
. .............
. .
59
Comparison of recognition performance for the full graphone hybrid
model and trimmed graphone models (4k and 2k graphone set size). We
also show results for the word-only recognizer. All of these recognizers
have 80% vocabulary coverage.
5.1
...................
..
59
Vocabulary sizes for the hybrid recognizer, and corresponding vocabulary coverage levels on the training and test sets ............
.
63
5.2
Example recognizer output for the hybrid and word-only recognizers
at 96% training corpus coverage, along with reference transcriptions.
Square brackets indicate concatenation of graphones (only letter part
shown). Vertical bars denote graphone boundaries.
5.3
. ..........
64
Examples of recognition output for 100% coverage word-only recognizer and a 98% coverage hybrid recognizer, along with the corresponding reference transcriptions. Square brackets denote concatenated graphones. Vertical bars show graphone boundaries. Note that this table
differs from Table 5.2 in that the hybrid and word-only recognizers
have different coverage levels.
5.4
......................
. . .
68
Comparison of recognition performance for the full graphone hybrid
model and trimmed graphone models (4k and 2k graphone set size).
Results are also shown for the word-only recognizer. All of these recognizers have 96% vocabulary coverage on the training corpus.
5.5
.
.
.
69
Performance of hybrid recognizers with full vocabulary compared to the
full-coverage baseline word-only recognizers. Word-only lx and 10x are
built with one and ten copies of the training corpus, respectively. The
OOV column is for replacing graphone sequences by OOV tags. The
Gr column is for concatenating graphones together to form words.
5.6
Comparison of hybrid approaches for the lecture transcription task at
98% vocabulary coverage. ................
5.7
71
.
. . .
.
75
Examples of recognition output from hierarchical, flat, and word-only
models. All recognizers have a 98% training corpus coverage. Square
brackets denote concatenated graphones (only letter parts shown).
Vertical bars show boundaries between graphones. . ..........
76
A.1 Word recognition performance using automatically-generated pronunciations for restaurant and street names.
Graphone L2S converters
with various L and N values are used to generate pronunciations. Results are shown in WER (%). This data is used to generate Figures
3-1, 3-2, 3-4, 3-5, 3-6, and 3-7. ......
...........
... . .
82
A.2 Word recognition performance with various number of pronunciations.
The graphone converter is trained with L = 1, and N = 6.
The
spellneme data is from [13]. Results are shown in WER (%). This
data corresponds to Figure 3-3 ...........
.
..
.......
. .
82
B.1 Performance of the flat hybrid recognizer on music lyrics. Ctrain and
Ctest are vocabulary coverages (%) on the training and test sets, respectively. IVI is the vocabulary size. Word & OOV is replacing each
graphone sequence in the recognizer's output by an OOV tag. Word
& Graphone is concatenating graphones to spell OOV words. All error
rates are in percentages. This data corresponds to Figures 4-1, 4-2,
and 4-3. ..................
.........
......
84
C.1 Performance of the flat hybrid recognizer on transcribing spoken lectures.
Ctrain
and Ctest are vocabulary coverages (%) on training and
test sets, respectively. IVI is the vocabulary size. Word & OOV is replacing each graphone sequence in the recognizer's output by an OOV
tag.
Word & Graphone is concatenating graphones to spell OOV
words. All error rates are in percentages. This data corresponds to
Figures 5-1, 5-2, and 5-3.
..................
.....
.
86
16
Chapter 1
Introduction
1.1
Motivation
Current speech recognition platforms are limited by their knowledge of words and how
they are pronounced. Almost all speech recognizers have a pronunciation dictionary
that has a finite vocabulary. If the recognizer encounters a word that is not in the
speech recognizer's vocabulary, then the recognizer cannot produce a transcription of
the word. This limitation is detrimental in many speech recognition applications. For
example, if the user wants to find the name of an obscure restaurant while driving
home from work, the user could use a spoken dialogue interface in the car, but the
dialogue interface may have no idea how the restaurant name should be pronounced,
resulting in the inability to understand the user's request and the failure to retrieve
information about the restaurant. Another example is transcribing video to enable
conventional text-based search. Uncommon words in these videos are often not in
the speech recognizer's vocabulary. In addition, because a recognition system needs
to find boundaries between words in the input audio stream, a misrecognized word
can easily cause the surrounding words, or even the entire sentence to be completely
misrecognized.
This often results in user frustration and reduced user confidence
in speech dialogue systems in general. Therefore, in order to build speech dialogue
systems that are applicable in practical user scenarios, we must develop techniques
to mitigate the effects of out-of-vocabulary (OOV) words.
There are several methods to combat the OOV problem. One possibility is to
dynamically generate pronunciations of OOV words prior to recognition. For a searchby-voice application, pronunciations can be automatically generated for new words
as they are imported into a database. These newly generated pronunciations can
then be added to the speech recognizer's vocabulary such that they are no longer
out-of-vocabulary.
However, we may not be able to perform all large vocabulary
search-by-voice tasks by increasing the vocabulary. If we increase the vocabulary by
large factors, we may be hindered by sparsity in the available language model training
data, which could lead to more confusion and degraded recognition performance.
Another argument concerns the usability of these search-by-voice interfaces. When
a user makes a voice query, if there are truly no matching results, it's better to be
able to tell the user that the query produces no results, rather than automatically
constraining the audio query to search terms that do produce results, but are not
even close to what the user intended. In the latter case, the user may be conviced
that the lack of results is because of misrecognition.
Simply increasing the vocabulary of a recognizer also cannot solve the OOV problem in real-time transcription tasks, such as converting streaming broadcast news to
text. In this case, we have no way of knowing what the speaker will say, so we have
to be able to automatically hypothesize spellings of new words during recognition.
Traditional speech recognizers cannot do this because their recognition frameworks
are constrained by their known vocabulary, and are only able to output words in the
known vocabulary. One may wonder if we could just use a large lexicon as the speech
recognizer's vocabulary, and not worry about recognizing the rare esoteric terms in
speech. However, in most applications of real-time audio transcription, this approach
has a crucial weakness: OOV words are likely to be content words that are highly
important for the topic discussed in the audio. In a broadcast news scenario, since
newly-coined words frequently appear in headlines, a recognizer with finite vocabulary
is likely to miss key words in these news reports.
Vocabulary size growth in training corpora has been characterized in the past
[25]. Figure 1-1 shows the the growth of vocabulary size as a function of words in
10k.
IlOABSARD
VOYAG"IER
10A2
10f2
iOAS
10I4
10AJ8
WAS
10W7
106
Number ofTranin Wrds
Figure 1-1: Vocabulary sizes of various corpora as more training data is added (from
[25]).
several training corpora. These measurements are calculated by first randomizing
each corpus and then computing the vocabulary size while scanning the corpus from
beginning to end. This process is repeated several times, and the results are averaged
to yield the final measurements. ATIS, F-ATIS, VOYAGER are data of users talking
with a spoken language system. WSJ, NYT, and BREF are from newspaper text.
CITRON and SWITCHBOARD are from human-to-humman speech. In all of these
corpora, we can see that even at very large training corpora sizes, the vocabulary size
continues to increase rapidly without leveling off. This demonstrates that new words
continue to appear even as the training data gets very large. The same study also
examines the rate at which new words appear for the same set of corpora, as shown
in Figure 1-2. This data shows that as we increase vocabulary coverage, we get still
get a substantial amount of new words even at very large vocabulary sizes.
One method for mitigating the effect of OOVs is to use sub-lexical units to model
OOV words. These units represent pieces of words, rather than whole words. One
of the most successful types of these units are known as graphones, which are joint
sequences of letters and phonemes [6]. These units are automatically-learned from
a pronunciation dictionary so they have potential for applications in any language
as long as the written form and pronunciations can be represented as two streams
100.0
BREF
.0 -
0.5 -
10A2
VOYAGER
VOYAGER
10
104
1I0O5
06
Vocabulary Size
Figure 1-2: New-word rate as the vocabulary size is increased for various corpora
(from [25]).
symbols. These units are useful for both automatically generating the pronunciation
of words in a dictionary as well as constructing a recognizer that can automatically
hypothesize spellings of OOV words. Because of the urgent need for effective OOV
models and the success of graphone models so far, we focus the work in this thesis on exploring applications of graphone models in the existing speech recognition
framework at MIT.
1.2
Summary of Results
This section contains an overview of the experiments presented in this thesis, along
with important results.
* Generating pronunciationsfor restaurant and street names. A graphone-based
letter-to-sound converter is built and used to automatically generate pronunciations for restaurant and street names.
Word recognition is then used to
evaluate the quality of these pronunciations. These experiments show that using a 6-gram, with graphone sizes of 0-1 letters/phonemes, produces the best
results. The generalizability of these results are demonstrated through dividing
the test set into seen and unseen words. We also explore using multiple L2S
pronunciations for each word, and conclude that using 10 pronunciations per
word produces the best performance without increasing confusion within the
recognizer. Graphone results are also shown to compare favorably with those
of linguistically-motivated subword models.
* Music lyrics hybrid recognition. A flat hybrid recognizer is built using graphones, and tested on transcribing spoken lyrics in a closed-vocabulary scenario.
The hybrid recognizer significantly outperforms the word-only recognizer at all
vocabulary coverage levels. We find that a hybrid recognizer with >92% vocabulary coverage on the training set actually yield lower recognition error rates
than a word-only recognizer with 100% coverage on the training set. The results
also demonstrate that graphones are useful for preventing misrecognition of the
neighbors of OOV words, as well as hypothesizing the spelling of OOV words.
In addition, we experiment with reducing the graphone set significantly, and
show that the hybrid recognition performance degrades slightly, but is still significantly better than the word-only recognizer at the same vocabulary coverage
level.
* Transcribingspoken lectures. Spoken lectures are transcribed using a flat hybrid
recognizer at several vocabulary levels. This task is open vocabulary because we
cannot know all the words the speaker will say during the lecture. The hybrid
recognizer's performance is significantly better than the word-only recognizer
at 88-96% vocabulary coverage on the training set, and slightly better than
the word-only recognizer at 98-99% coverage levels. We show that the hybrid
model's performance remains about the same when the graphone set size is
reduced from 19k to 4k. A hybrid recognizer with 100% vocabulary coverage
is also built and shown to perform about the same as a word-only recognizer
at 100%, even though the graphones can add more confusion to the recognizer.
Finally a hierarchical hybrid model is built and compared with the flat hybrid
model. Results show that at 98% training set coverage, these two hybrid models
perform about the same, and both outperform the word-only recognizer.
1.3
Outline
In the rest of this thesis, we first give an overview of previous works that examine the
OOV problem. We then discuss the background technology used in this thesis, which
are graphone models and the SUMMIT speech recognition framework. In subsequent
chapters, we explore applications of graphone models for speech recognition in areas
described in Section 1.2. Finally, we discuss future work that builds upon the research
presented in this thesis.
Chapter 2
Background
Since OOV words are inevitable in large vocabulary speech recognition tasks, we must
find ways to cope with them in order to build practical speech applications. In this
chapter, we first review past techniques for mitigating the effect of out-of-vocabulary
(OOV) words on speech recognition performance. Next, we introduce graphone models for both automatic pronunciation generation and hybrid speech recognition. Finally, we give an overview of the SUMMIT speech recognition framework used in this
thesis.
2.1
Working with OOVs
Several techniques have been developed to mitigate the effect of OOV words on speech
recognition performance. The most straight-forward approach is to increase the vocabulary of the recognizer, but this cannot completely eliminate OOVs in open vocabulary tasks such as audio transcription. Other approaches focus on adding probabilistic machinery to the speech recognizer to detect OOVs in utterances, and possibly
even hypothesize pronunciations and spellings associated with the OOV word. These
techniques include confidence scoring, filler models, or a combination of both. Multistage recognition can also be used to improve large vocabulary recognition by using
different recognition strategies in each stage.
2.1.1
Vocabulary Selection and Augmentation
One way to combat OOVs in speech recognition is to select the best set of words
to be in-vocabulary for the target domain. If the system has a pretty good idea of
what the user might say, the system can be trained with a set of words that give a
small OOV rate. One example is a system that provides weather information, which
can be trained with a set of geography-related words based on frequency of usage
[43]. Martins et al. combine morphological and syntactic information to optimize
the trade-off between the predicted vocabulary coverage and the number of added
words [30]. Morphological analysis is used to label words with part-of-speech tags,
which are then used to optimize vocabulary selection. Another recent study explores
using the internet as a source for determining which words to include in the speech
recognizer's vocabulary [32]. This technique uses a two-pass strategy for vocabulary
selection, where the first pass produces search engine requests, and the second pass
augments the speech recognizer's lexicon with words from returned online documents.
We can also increase the vocabulary by automatically generating pronunciations
for unknown words. This can be performed by encoding linguistic knowledge, or automatically learning letter-sound associations. The older techniques include purely
rule-based approaches that use context-dependent replacement rules [18].
Subse-
quent techniques have been more data-driven, such as using finite-state transducers
[38], or decision trees to capture the mapping between graphemes and phonemes
[15]. Another technique is hidden Markov models (HMM), which formulates a generative model for finding the most likely phoneme sequence for generating the observed
grapheme sequence [37]. There are discriminative techniques as well, such as perceptron HMM trained in combination with the margin infused relaxed algorithm (MIRA)
[27]. Finally, there has also been a series of successful techniques based on sub-lexical
models, which are the most relevant to this thesis.
Probabilistic sub-lexical models capture probabilities over chunks of letters commonly referred to as subwords, or sub-lexical units. These units are often correlated
to syllables that commonly occur in a language. Some of these models automatically
learn the sub-lexical units, while others have a predefined set of them. However, they
all use a probabilistic framework to capture the dependencies between the subwords
that make up words in the training set. The advantage of probabilistic approaches is
that these methods can generally handle the problem of aligning the grapheme and
phoneme sequence much better than many non-probabilistic approaches.
One technique is to use context-free grammar (CFG) parse trees with a superimposed N-gram model to parse words into lexical subwords [31, 34, 35]. First, the
pronunciation of words in the lexicon are converted to sequences of subword pronunciations through simple rewrite rules. Next, a rule-based CFG is applied to convert
the words in the lexicon to subwords, using the subword pronunciation sequences
as a constraint. This process converts the lexicon into lexical subword units containing both letter and sound information, which are also called spellnemes. Failing
parses are either manually corrected, or made possible by the addition of new rules.
This correction process is repeated iteratively until the entire lexicon is successfully
converted into subword units. Finally, N-gram probabilities are learned from lexical
subword sequences such that this statistical model can be used to segment unknown
words into lexical subwords.
Another method is maximum-entropy N-gram modeling of sub-lexical units, in
the form of either a conditional model, or a joint model [11]. The conditional model
captures probabilities of phonemes conditioned on previous phonemes and letters, over
a sliding window. Probability distributions are estimated using maximum entropy
with a Gaussian prior. For the joint model, a maximum entropy N-gram model is
calculated on training data of words segmented into chunks consisting of either a
single letter, a single phoneme, or a single letter-phoneme pair.
Joint grapheme-and-phoneme sequences have been used to model the letter-tosound relationships. These models, known as graphone models, are automatically
learned from a pronunciation dictionary through EM training with language modeling.
They are described in detail in Section 2.2.1.
It is still an open question whether having a huge vocabulary recognizer is better or
worse than an open-vocabulary recognizer that can recognize arbitrary words. One
recent study focused on Italian, and shows that augmenting a speech recognizer's
lexicon with automatically-generated pronunciations of OOV words performs slightly
better than OOV modeling in an open-vocabulary recognizer [20]. However, in many
applications, such as transcribing audio files, there is no way to eliminate all OOV
words because we do not know what the speaker will say. In these cases, we can use
methods to detect and hypothesize spellings of OOV words.
2.1.2
Confidence Scoring
Confidence scoring allows us to get a sense of how likely a hypothesized word or word
sequence is an OOV word. Several criteria can be used to generate the confidence score
for region in the recognition hypothesis. They include word posterior probability,
acoustic stability, and hypothesis density [41, 28]. Word posterior probability can
be computed on the output lattice or the output N-best list of a speech recognizer.
Word lattice posterior probabilities are probabilities associated with edges in the
word lattice, and can be computed by the well-known forward-backward algorithm.
Word N-best list posteriors are different from lattice posteriors in that the N-best
list does not have a time associated with each word in the hypothesis. This could
be an advantage because the probabilities for the same speech recognizer output are
essentially collapsed together. Acoustic stability is a measure of how likely a given
word occurs in the same position in all hypothesized N-best list. Hypothesis density
is a measure of how many hypotheses have similar likelihoods at a given point in time.
The higher the hypothesis density, the more likely the error. Comparison of these
measures show that using word posteriors on word lattices yield the best results [41].
Multiple confidence measures can also be combined to construct confidence vectors,
which are then passed through a linear classifier to determine if the corresponding
word is OOV [24].
More recent studies have explored comparing information from several different
sources to detect OOVs. Lin et al. jointly align word and phone lattices to find OOV
words [29]. Another study compares confidence measures from strongly constrained
and weakly-constrained speech recognizers [10]. The strongly-constrained recognizer
is a regular speech recognizer, while the weakly constrained data comes from a bigram
phoneme recognizer and a neural network phone posterior estimator. Word-to-phone
transductions have also been combined with phone-lattices to find OOVs [42]. In
addition to comparing output of weakly and fully constrained phoneme recognition
outputs, this study uses a transducer to decode the a frame-based recognizer's top
phoneme output into words, and then compares this output with the word output of
a frame-based speech recognizer.
2.1.3
Filler Models
Filler models vary in degrees of sophistication, depending on the application. Some
filler models are based on phoneme sequences, and can hypothesize an OOV's pronunciation. Other models are based on joint letter-sound subwords, which not only can
hypothesize pronunciation, but also can hypothesize spelling of OOV words. Filler
models can be incorporated into a traditional speech recognition network through two
main approaches: hierarchical hybrid and flat hybrid. In the hierarchical approach,
an OOV tag is added to the speech recognizer's vocabulary. At a given point in
decoding an utterance, the recognizer has a choice of decoding the next portion of
the utterance as a word, or the OOV tag. This OOV tag is modeled by a separate
OOV model that has probabilistic information of phones, syllables, or both. In these
recognizers, an extra cost can be added for entering an OOV model because we would
prefer to recognize words if the words are in-vocabulary. In the flat-hybrid approach,
OOV words in the recognizer's training corpus are turned into subwords or phonemes,
which are treated just like regular words in the rest of the recognizer training process.
Bazzi and Glass use phones and syllables to model OOV words through a hierarchical approach [3, 4]. The phoneme-level OOV model is an N-gram model computed
on a training corpus of phoneme sequences. Later work in this area use automaticallylearned variable-length phonetic units for the OOV model [5]. In this study, mutual
information is used as a criterion for merging phonemes into bigger sequences. The
authors show that using variable-length units for the OOV model leads to improved
recognition performance over using only individual phonemes. They also show that if
individual phonemes were used, it's better to train the OOV model on phoneme sequences in a large-vocabulary dictionary rather than on phoneme sequences of words
in a recognizer's LM training corpus. This reduces both domain-dependent bias of the
OOV model, as well as bias toward phonemes of frequent words in the LM training
corpus.
Modeling OOV words with phonetic units can be effective for OOV detection,
but does not allow us to hypothesize the spelling of an OOV word. To do so, our
OOV model must incorporate spelling information into subword units. For example,
spellnemes can do this by encoding both phonetic and spelling information in each
spellneme [35]. In a previous study, Choueiter incorporates spellnemes into a recognizer in a flat-hybrid configuration to model OOV words [12], and demonstrates that
using spellnemes in flat hybrid recognition decreases word error rate and sentence
error rate for music lyrics spoken by a user with the intention to search for a song.
Some OOV words are spelled correctly through spellneme modelling.
Graphones have also been used in flat hybrid recognition [8, 19]. Graphones also
contain both spelling and pronunciation information, and are trained in an unsupervised fashion. The details of graphone flat hybrid recognition are described in Section
2.2.2.
2.1.4
Multi-stage Combined Strategies
Efforts have been made to combine confidence scoring and filler model approaches in a
multi-stage framework. In the first stage, an utterance is first recognized into subword
or phoneme lattices. In the second stage, these lattices are used to generate confidence
scores for words or time frames. One technique is to use a phoneme OOV model in the
first stage to detect OOV words, and then use confidence scores on remaining words
to detect words with confidence [23].
Since filler models can introduce confusion
with in-vocabulary words, it would be helpful to have good methods to prevent invocabulary words from being detected as OOV words. As an alternative to having a
cost for diving into the OOV model, confusion network and vector space model post
processing can be used to detect OOVs while accounting for confusion between in-
vocabulary terms and OOV words [33]. Multi-stage frameworks can also have more
than two stages. Chung et al. present a three-stage solution, where the first stage
generates phonetic networks, the second stage produces word network, and the third
stage parses the word network using a natural language processor [14].
2.2
Graphone Models
In this section, we describe the training process for graphones, and their applications in speech recognition. We first discuss letter-to-sound (L2S) conversion because
graphones were originally developed for the L2S task. L2S conversion is also an important part of vocabulary augmentation. We then discuss how graphones models can be
incorporated directly into speech recognizers to construct hybrid speech recognition
systems.
2.2.1
Graphone Letter-to-Sound Conversion
Graphone models capture the relationship between a word and its pronunciation
by representing them as a sequence of graphone units. Each graphone unit has a
grapheme sequence part and a phoneme sequence part. Maximum-likelihood learning
on a pronunciation dictionary is used to automatically learn graphone units as well as
transitions between graphones. In various publications, graphones have been referred
to as joint-multigrams, graphonemes, and grapheme-to-phoneme correspondences.
The symbol sequences within Ak's can be concatenated to form a sequence R.
We can do the same for the sequences within Bk's to form the sequence S. For the
machine learning task, we observe the sequences R and S, and we want to estimate
the underlying unit sequence G. In the L2S conversion context, R is the sequence
of letters and S is the sequence of phonemes. The term graphone refers to gk in
this formulation. Table 2.1 shows some examples of graphone sequences for English
words. The process of building a L2S converter is a three-step process. The first
step is to develop a graphone model that best represents the data. The second step
is to segment the training data into graphones. The third step is to build a N-gram
gl
g2
g3
gk
Figure 2-1: Generative probabilistic process for graphones as described by Deligne
et al. According to an independent and identically-distributed (iid) distribution, a
memoryless source emits units gk that corresponds to Ak and Bk, each containing a
symbol sequence of variable length.
Word
alphanumeric
colonel
dictionary
semiconductor
xylophone
Graphone Sequence
(alph/ae 1 f) (anum/ax n uw m) (eric/eh r ax kd)
(colo/k er) (n/n) (el/ax 1)
(dic/d ih kd) (tion/sh ax n) (ary/eh r iy)
(semi/s eh m iy) (con/k ax n) (duct/d ah kd t) (or/er)
(xylo/z ay 1 ax) (phon/f ow n) (e/)
Table 2.1: Examples of graphone segmentations of English words. These maximumlikelihood segmentations are provided by a 5-gram model that allows a maximum of
4 letters/phonemes per graphone.
language model on the segmented training dictionary. This language model can then
be used to perform L2S conversion on new words.
The theoretical foundation for graphone models is presented by Deligne et al.,
where they illustrate a statistical learning framework for aligning two sequences of
symbols [16, 17]. The underlying assumption is that there is a memoryless source that
emits the sequence of units G = g 1 , g2 , 9 3 ,
... ,
gK, where gk contains two parallel parts:
Ak and Bk (see Figure 2-1). Ak and Bk are each variable-length symbol sequences
themselves.
For the first step, the graphone model is learned through maximum-likelihood
training. Expectation maximization (EM) is performed on the training dictionary by
running the forward-backward algorithm on directed-acyclic graphs (DAG) representing letter and phoneme sequences. Each DAG corresponds to one word-pronunciation
pair. It can be visualized as a 2-dimensional grid: one dimension represents the
traversal of letters in the word, and the other dimension represents the traversal of
t
aa
r
r
Figure 2-2: Directed acyclic graph representing possible graphone segmentations of
the word-pronunciation pair (far, f aa r). Black solid arrows represents graphones
of up to one letter/phoneme. Green dashed arrows (only a subset shown) represent
graphones with two letters or two phonemes. Traversing the graph from upper-left
corner to lower-right corner generates a possible graphone segmentation of this word.
phonemes in the pronunciation (see Figure 2-2). Each arc in the DAG corresponds
to a graphone, and is labeled with the graphone's probability. During training, we
must specify the minimum and maximum number of letters and phones allowed for
each graphone, which limits the length of arcs in the DAG. Any path through this
DAG represents a possible graphone segmentation of this word-pronunciation pair.
The forward algorithm runs from the upper-left corner toward the lower right corner,
while the backward algorithm runs the opposite way. These forward and backward
scores are used to calculate the posterior for each arc. Posteriors corresponding to
the same graphone are accumulated over DAGs of all dictionary entries, and then
normalized to generate the updated probability for that graphone. This process is
repeated until the likelihood of the training data stops increasing.
Mathematically, the EM training process can be formulated as follows. The likelihood of a dictionary, A, is the product of the likelihood of W word-pronunciation
pairs in the training dictionary. For a word-pronunciation pair, let the letter sequence
be rl, r 2 , ... TM, and the phoneme sequence be si, S2,
... , SN.
We also denote a stream
of graphones as gl, g2, ..., gK. The likelihood of this word-pronunciation pair is the
sum of the likelihoods for all possible segmentations. The likelihood for a possible
segmentation, z, is p(z) =
k=l P(gk). The goal of the EM training goal is to find a
graphone model that maximizes the likelihood of the training dictionary which is:
w
p(A) =-I(
w=1
p(z))
(2.1)
zE{z}w
where {z}, is the set of all segmentations of the w-th word in the dictionary.
The forward-backward algorithm is formulated as follows. Let tm,n represent the
node at location (m, n) in the word-pronunciation pair's DAG, where to,o is the upper
left node in the DAG, and
tM,N
is the lower right node (see Figure 2-2). Let ac(tm,n)
be the forward score and 3(tm,n) be the backward score corresponding to node tm,n.
The forward score for tm,n represents the likelihood of letters and phonemes between
to,o and tm,n. The backward score for tm,, represents the likelihood of the of the letters
and phonemes between tm,n and tM, N. For boundary conditions, we use a(to,o) = 1
and
3
(tM,N) = 1. Let Xrm,n denote an arc that leads into node tm,n, and let ,n
denote the starting node of this arc. Let Ym,n denote an arc that leads out of node
tm,n, and let ym,n denote the ending node of this arc. In addition, {Xm,n} denotes
the set of all arcs that enter node tm,n, and {Ym,} is the set of all arcs that leave
this node. As we move through the graph, the forward and backward scores can be
computed using:
a(tmn,n) =
a(xm,n)p(Xn,n)
(2.2)
(Yrn,n)P(Ym,n)
(2.3)
{Xm,n)
p(tmn,n)= E
{Ym,n }
After the forward and backward scores are calculated, we can re-estimate the
posterior probability for each graphone gq used in the word-pronunciation pair w:
POSw (gq)
=
(rmn,n)e {gr(Xm,n)=gq}
pos,,(g,) =(2.4)
a(Xrn,n)p(X,n)(trn,n)
m,where
a(tgr(X,)
isthegraphone
to corresponding
(tm,n)
where gr(Xmn) is the graphone corresponding to the arc Xmn,
(2.4)
The posterior probabilities for all graphones are accumulated through all wordpronunciation pairs in the training dictionary, and normalized such that graphone
probabilities sum to one. In the following equation for updating the graphone probabilities, Q denotes the graphone set size.
posw (g,)
EW
pw=
1 EQpsw(gq)
(2.5)
After updating probabilities for all graphones, we check if the training dictionary's
likelihood has stopped increasing, and repeat another EM iteration if needed. The
training dictionary's likelihood is calculated by taking the product of a(tM,N)'s for all
words. The number of graphones can be controlled in a number of ways, the simplest
being trimming with a probability threshold. Also, in a practical implementation,
the probabilities would be stored in the log domain to reduce the effect of machine
precision.
For the second step of the graphone training process, the Viterbi algorithm is
applied on the DAGs to segment the entire training dictionary into sequences of
graphones. Running this algorithm on a word-pronunciation pair involves calculating
the Viterbi scores for each node in its DAG and setting back points which can then
be used to perform a backward trace to identify the graphones corresponding to the
best segmentation. The Viterbi scores,
*(tm,n), are updated while moving from the
upper-left to the lower-right corner of the DAG. This score represents the probability
of the best graphone segmentation up to node t,n. It can be calculated using the
following equation:
7*(tm,n) =
max p(Xm,n)y*(Xm,n)
(2.6)
{xm,n}
The back-pointer for node tm,n is set to point to the previous node that corresponds
to the maximum value in Equation 2.6.
For the third step, standard N-gram modeling is used to compute statistics over the
segmented training dictionary, which includes smoothing and backing-off to shorter
N-grams. These statistics can then be combined with standard search algorithms
(e.g. Viterbi search, beam search, Djikstra search) to decode an unknown word into
its most-likely pronunciation.
In 2002, Bisani and Ney improve upon Deligne's formulation by adding an effective
trimming method during training for controlling the number of graphones [6]. This is
important because only a subset of the graphones that satisfy the letter/phoneme size
constraints are actually valid chunks for modeling the training data. The intuition
behind this is closely related to choosing morphemes that best represent the training
data. Because only some of all possible chunks resemble valid morphemes or syllables,
trimming can help us remove spurious graphones from our set. This study explores
various constraints for the number of letter or phonemes allowed in a graphone, and
finds that allowing 1-2 letters and and 1-2 phones per graphones, combined with a
trigram language model gives the best performance in terms of phoneme error rate
(PER). This trimming method, called evidence trimming, trims graphones based on
how much they are used in segmenting words-pronunciation pairs in the dictionary,
and is slightly different from trimming graphone probabilities. For a given iteration of
EM, the evidence for a graphone refers to its accumulated posterior probability over
all word-pronunciation DAGs. The graphone set is trimmed by applying a threshold
for the minimum amount of evidence needed for a graphone to remain in the model.
The probability of the trimmed graphones are then redistributed over the remaining
graphones. Note that this is actually the opposite of the intuition from a language
modeling point of view, where we use smoothing to assign probabilities to unseen
data.
Both Bisani-Ney and Deligne et al. consider Viterbi training rather than EM training as a way to speed up the training process. During Viterbi training, only posterior
probabilities for graphones along the best path in each DAG is accumulated. This
could be a good approximation of considering all paths. However, both studies find
that Viterbi training is highly sensitive to the initialization of graphone probabilities.
One good initialization method is to initialize graphone probabilities proportional to
each graphone's occurrence in all possible segmentations of the training dictionary.
In subsequent studies, Bisani and Ney investigate the use of graphones in large vo-
cabulary continuous speech recognition (LVCSR) [7]. Several studies by other authors
also apply graphones to tasks such as dictionary verification [40]. Because graphone
models are trained probabilistically, they are good for spotting errors in manuallyannotated dictionaries. For entries that are likely to be erroneous, the probability
associated with its graphone segmentation is very low compared to others. Using this
information, we can automatically detect the mislabeled entries in the dictionary.
In a later study by Bisani and Ney, they revisit the L2S task with an revised
training procedure that integrates EM training with language modeling (LM) [9],
and produce results that are better than or comparable to other well-known L2S
methods. EM training is first used to determine the set of graphone unigrams, in
almost the same fashion as before. Then, the next-highest N-gram model (i.e. bigram)
is initialized with the current model, and trained to convergence using EM. This
process is repeated for successively-higher N-grams until the desired N-gram length
is achieved. This differs from their previous study [6] in that EM is now performed
for each N-gram length, rather than estimating the final language model directly on
a dictionary segmented by unigram graphones. In the higher-order N-gram models,
each node in the dictionary entry's DAG represents a position along the letter-sound
sequence as well as the history for the N-gram that ends at this position. This means
that unlike the older model described by Figure 2-1, the underlying probabilistic
model is no longer a memoryless emitter. Instead, graphone emission probabilities
are now conditioned on the previous N - 1 emissions. Therefore, a node in the
DAG is only connected to previous nodes that satisfy the N-gram history constraint.
Each graphone N-gram is associated with an evidence value that is accumulated
through all DAGs during each iteration of EM. Kneser-Ney smoothing is applied to
graphone N-gram evidence values at the end of each EM iteration. Because these
counts are fractional, we can't use discount parameter selection methods that assume
integer counts. Powell's method is used to tune smoothing discount parameters by
optimizing on a held-out set. When a graphone N-gram's fractional count falls below
the discount parameter value, this N-gram is effectively trimmed. Another difference
between this study and their previous study is that the new EM convergence criteria
is the likelihood of a held-out set rather than the training set, which could increase
the generalizability of the L2S converter. This study concludes that training a long
N-gram (i.e. 8 or 9-gram) along with tiny graphones (i.e. 0-1 letters/phonemes)
produced the best L2S results. This finding is similar to the results produced by
Chen with joint maximum entropy models [11], which is expected because the two
methods have similar statistical learning processes. The results of this study are also
comparable to the discriminative training results reported in [27].
2.2.2
Graphone Flat Hybrid Recognition
In speech recognition, graphones can be an effective method to mitigate the OOV
problem in open vocabulary speech recognition. Ideally, OOV words could be recognized into a sequence of graphones that can be concatenated to form the correct
spellings. There may be cases where an in-vocabulary word is recognized as a series
of graphones, but as long as the graphone sequence still generates the correct spelling,
the speech recognizer's output is still accurate. One way to move closer to this goal
is to use a flat hybrid model for the speech recognizer's language model. To build
this model, OOV words in the recognizer's training corpus are replaced by their graphone representations. Finding the graphone representation of a word can be done
through the same process as L2S conversion, except we stop just before converting
the most-likely graphone sequence to a phoneme sequence. From now on, we will
refer to this process as graphonization. The only other part of the recognizer that
needs to be modified is the lexicon, which needs to be augmented with a list of graphones and their pronunciations. This model is flat in the sense that graphones and
words are treated as equals in the language model; the model is a hybrid because it
contains statistics of N-gram transitions between graphones and words. Bisani and
Ney have applied graphones in flat hybrid recognition for the Wall Street Journal
dictation task [8]. Their graphonization process uses an N-gram size of 3, and their
speech recognition's language model also has a N-gram size of 3. They find that at
lower vocabulary coverage (88.8% and 97.4%), the word and letter error rates are
significantly reduced by using this fiat hybrid model. At high vocabulary coverage
(99.5%), this model does not degrade the recognition performance even though the
added graphones could increase confusability with words. They also note tradeoffs
regarding the graphone size. Larger graphones are easier to recognize correctly, but
have worse L2S performance, leading to less accurate graphonization of OOV words
in the training corpus. Small graphone sizes (i.e. 1-2 letters/phonemes) can also
lead to more insertions in recognition output. In their experiments, they find that
the trade-off point is a graphone size of 4 letters/phonemes for 88.8% and 97.4%
coverage, and a graphone size of 3 for 99.5%.
Recently, Bisani-Ney's flat hybrid recognition have been combined with other
methods. Vertanen combines graphone hybrid recognition with confusion networks
and demonstrates improvement with the addition of the confusion network component [39]. This study also notes a technique for mitigating the tradeoff where bigger
graphones are not as good in graphonization performance as smaller graphones. Instead of graphonizing the language model's OOV words directly, they first use a tiny
graphone, large N-gram, L2S unit to generate pronunciations for OOV words, and
then graphonize OOV words using bigger graphones with the L2S pronunciations as
an additional constraint. Akbacak et al. use graphones to model OOV words in a
spoken term detection task [1]. The recent applications of the graphone flat hybrid
model show that this model is a start-of-the-art technique for speech recognition in
domains where OOV words frequently occur.
2.3
SUMMIT Speech Recognition Framework
The SUMMIT Speech Recognizer is a landmark-based speech recognizer that uses
a probabilistic framework to decode utterances into transcriptions [21]. The speech
recognizer is implemented as a weighted finite-state transducer (FST), which consists
of four stages: C o P o L o G. C and P encode phone label context and phoneme
sequences.
L is the mapping from phonemes to words, and G is the grammar or
language model (LM). The acoustic model is based on diagonal Gaussian mixtures,
and is represented by features derived from 14 Mel-Frequency Cepstral Coefficients
(MFCCs), averaged over 8 temporal regions. The mapping from phonemes to words
are constructed directly from a pronunciation lexicon. The language model is an Ngram model trained on a large corpus of text, which is usually in the same domain
as where the recognizer is used.
Decoding during recognition is performed through a two-pass search process. The
first pass is a forward beam search where traversed nodes are labeled with their Viterbi
scores. The second pass is a backward A* search where the Viterbi scores are used as
the heuristic for the remaining distance to the destination. A wider search beam can
produce more accurate results, but takes longer to run. To minimize computational
complexity, the forward pass usually uses a lower-order N-gram than the backward
pass. The backward A* search produces the recognizer's N-best hypotheses, along
with their associated probability scores.
Chapter 3
Restaurant and Street Name
Recognition
This chapter focuses on applying graphones to letter-to-sound (L2S) conversion of
restaurant and street names. Since the end goal of L2S in a speech recognition system
is to be able to correctly recognize the corresponding words, we evaluate the quality
of the automatically-generated pronunciations through word recognition experiments.
The objectives of these experiments are to find the optimal training parameters for the
graphone model, determine how using multiple pronunciations affects performance,
and verify that the graphone model generalizes well to truly unseen proper names.
3.1
Restaurant Data Set
The data set used in this experiment was originally collected for research reported in
[13], in which subjects were asked to read 1,992 names in isolation. These names are
selected from restaurant and street names in Massachusetts. They are representative
of a data set that could be used for a voice-controlled restaurant guide application.
These names are especially interesting because a significant portion of them are not
in a typical pronunciation dictionary, so in order for the word recognizer to recognize
these names, we must be able to automatically determine their pronunciations. In a
typical system, as new place names are imported into the system through an update
process, a L2S converter can be run to automatically generate pronunciations for
those words.
3.2
Methodology
The first step involves building graphone language models using the expectation maximization (EM) training process described in [9]. An established -150k-word lexicon
used by SUMMIT is used for the training process. Different graphone models with
maximum graphone size (L) of 1-4 letters/phonemes and graphone N-gram size (N)
of 1-8 are explored. L is the upper bound on the maximum number of letters as
well as the maximum number of phonemes in a graphone. There is no limit for the
minimum number of letters or phonemes, except that we do not allow graphones to
have neither letters nor phonemes (i.e. null). Note that it is important to allow 0phoneme graphones because some letters can be silent. Also, for large-L models, the
number of N-grams may no longer change after a certain N, and we do not continue
building higher-order models beyond this point. These models are saved to files in
ARPA language model format, which are then used to build L2S converters using the
MIT Finite-State Transducer (FST) Toolkit [26].
The L2S converter is the composition of three FSTs: letter to graphones, graphone
language model, and graphones to phonemes. The letter-to-graphone FST maps a letter sequence to all possible graphones that fit the letter sequence constraint. The graphone language model component is a direct FST translation of the ARPA language
model file. The graphone-to-phoneme FST concatenates the phoneme components of
the graphones. For a given word, a Viterbi forward search and A* backward search
within the L2S FST is used to generate an N-best list for possible pronunciations.
This L2S converter is then used to generate pronunciations for all 1,992 restaurant
and street names.
For evaluation, a word-recognizer is built using the automatically generated pronunciations. A standard set of telephone quality acoustic models are used [43]. Utterance data collected from users are then used to determine the word-error rate (WER)
of this recognizer. These results are also compared to the same experiments using
spellnemes.
3.3
3.3.1
Results and Discussion
Graphone Parameter Selection
Graphone parameters have a big impact on the performance of the system. Bigger
graphones with larger L values can capture more information individually, but each
word contains fewer graphones of this size. Smaller graphones on the other hand,
capture less information individually, but because each word contains more graphones,
we can more effectively use established language modeling techniques to make the
model generalizable. Bigger graphones also lead to a much larger set of graphones
compared to smaller graphones.
Graphone models with different L and N parameters were used to generate word
pronunciations. The raw data for graphs in this section are located in Appendix A.
Figure 3-1 shows the recognition word error rate (WER) for different combinations
of L and N values. For lower N values, high-L models perform the best because
low-L graphones capture little information individually. For high N values, lowL models perform better, most likely because the language modeling framework is
more powerful and generalizable than the EM framework that identifies the graphone
set. One explanation for this could be that unlike the Kneser-Ney discounting used
with N-grams, there is no discouting that discounts larger graphones for smaller
graphones. Numerical precision may also play a role. For large L, each graphone
unigram has a very small probability, making it likely that some valuable graphones
could be discarded when their evidence falls below numerical precision. For a sense
of comparison, graphone set sizes for the unigram models are: 0.38k, 3.7k, 20k, 54k,
for L = 1, 2, 3, 4 respectively. Numerical precision could also play a role for higherorder N-grams as well, but because the training process trains each N-gram model
to convergence before ramping up to the next-higher N-gram, we have much more
L=1
- L=2
-h- L=3
---
65~
---
60
60
L=4
cr 55
o
50
= 45
0
r: 40-
35
1
2
3
5
6
4
Graphone N-gram Size
7
8
Figure 3-1: Word recognition performance using automatically generated pronunciations produced by a graphone L2S converter trained with various L and N values.
model stability as we increase N.
These results show that the best L-N combination is L = 1, and N = 6, which
has a WER of 35.2%. For L = 1, as we increase N past 6, the WER actually starts
to increase slightly. This could be due to over-training as we increase the model order
too much. For L = 2, WER does not decrease beyond a 5-gram. This trade-off point
is around a trigram for L = 3, and bigram for L = 4.
Comparing graphone results with spellneme results from [13] shows that the best
graphone models outperform spellneme models. With spellnemes, the WER is 37.1%
while the best graphone model yields a WER of 35.2%. This could be because for
graphones, EM training with smoothing is run for both unigrams as well as higherorder N-grams. For spellnemes however, although N-gram statistics are computed
on a subword-segmented dictionary, there is no additional learning process for readjusting the higher-order N-gram models.
Word
albertos
creperie
giacomos
isabellas
renees
sukiyummi
szechuan
tandoori
valverde
First Hypothesis
aa 1 b eh r t ow z
k r ey pd r iy
jh ax k ow m ow z
ih z ax b eh I ax z
r ax n ey z
s uw k iy uw m iy
s eh ch w aa n
t ae n d ao r iy
v aa 1 v eh r df ey
Second Hypothesis
ae Ib er t ow z
k r ax p r iy
jh ax k ow m ow s
ax s aa b eh I ax z
r ax iy z
s uw k iy y uw m iy
s eh ch uw aa n
t ae n d ao r ay
v aa 1 v eh r df iy
Third Hypothesis
ae Ib er t ow s
k r ax p er iy
jh ax k ow m ax s
ih z ax b eh I ax s
r ax n iy z
s uw k iy ax m iy
s eh sh w aa n
t ae n d ao r ax
v ae 1 v eh r df ey
Table 3.1: Examples of the L2S converter's top 3 hypotheses using the graphone
language model.
3.3.2
Adding Alternative Pronunciations
The pronunciation dictionary can have multiple pronunciations for each word. In
fact, a manually-annotated dictionary will definitely have multiple pronunciations for
some words because words can be pronounced in different ways. For an automaticallygenerated dictionary, we could just use the top hypothesis for the L2S converter as
the pronunciation. However, the next few hypotheses in the recognizer's output Nbest list (ranked by decreasing probability) could be valuable as well, especially if a
word truly does have multiple pronunciations. Interestingly, for restaurant and street
names, multiple pronunciations can have another added benefit: people may pronounce names incorrectly because they are guessing based on prior language knowledge. For example, for a French restaurant name, some people may pronounce it
closer to its French pronunciation, while others may say it by reading the name as
if it were an English word. If the L2S converter suggests some of these commonlyguessed pronunciations as alternatives, then we are more likely to recognize place
names correctly, even if the subject doesn't pronounce it correctly in the first place.
This all works under the assumption that the L2S converter's top hypotheses are the
ones that make the most linguistic sense based on the training dictionary. For the
graphone-based L2S, this is usually the case when the training dictionary is large.
Table 3.1 shows some examples of graphone L2S conversion using L = 1 and N = 6.
-e-e--A-0-
6
60
L=I
L=2
L=3
L=4
0
'm 50
o
0
30
1
1
2
3
1
4
5
6
Graphone N-gram Size
7
8
Figure 3-2: Word-recognition performance using top 2 pronunciations generated using
various L and N values.
For most of the words, the first hypothesis is an excellent pronunciation. Sometimes,
the quality of the pronunciation degrades rapidly, such as for renees, but for most
words, the first few hypotheses are all reasonable approximations of the true pronunciation. Some of these variations account for variations in phonemes that are very
close to each other, such as s and z, and ax and ae. These variations are also useful
for how fast the person says the word. For creperie, the first two pronunciations are
more accurate for someone saying it at conversational speed, especially if they know
french, whereas the third pronunciation is in more enunciated form.
First, we explore various values of L and N for generating top 2 pronunciations
for each word. The WER graph, shown in Figure 3-2, has a similar shape as the
graph for using top pronunciation only. These results are slightly better than using
the top 1 pronunciation. The best L-N combination is still L = 1 and N = 6, with
a WER of 31.6%. Since for both top 1 and top 2 pronunciation cases, the same L-N
combination gives the best results, we use this combination for experimenting with
even longer lists of alternative pronunciations.
To study the effect of adding more alternative pronunciations, the number of pronunciations per word is varied between 1 to 50 for L = 1 and N = 6. Multiple
pronunciations for each word are weighed equally in the recognizer's FST framework.
We expect that as we increase the number of pronunciations per word, the wordrecognition performance first gets better because we are adding more useful variation
for each word, and then gets worse because too many pronunciations introduces too
much confusion into the recognition framework, and dilutes the probabilities for the
correct pronunciations. These results, along with a comparison to spellnemes are
shown in Figure 3-3. We see the most recognition improvement when the number
of pronunciations is increased from 1 to 3. We then have a trade-off point of about
10 pronunciations per word before the WER starts increasing with added pronunciations. These results indicate that using graphones are the most advantageous than
spellnemes for using the top 1 hypothesis. Spellnemes do better at high number
of included pronunciations, possibly because their lexically-derived nature allows the
pronunciation quality to degrade more gradually as we move down the N-best hypothesis list. In addition, because spellnemes are more linguistically-constrained, they may
be better at generating alternative pronunciations when applying phonological rules
leads to multiple correct pronunciations that are distant from each other.
3.3.3
Generalizability of Graphone Models
Finally, we explore the generalizability of graphone L2S for this task. The restaurant
and street name data set includes some common English names that are likely to
occur in the training dictionary. The set of truly unseen names provide a good test
for the generalizability of the model. Out of the 1,992 names, 626 names have an exact
match in the dictionary used to train the graphone model, and the remainding 1,366
are considered to be unseen. In order to generate good pronunciations for unseen
words, the L2S converter must use the probabilities learned from the dictionary to
guess the pronunciation of the unseen words. Also, this task is more challenging than
evaluating L2S performance by dividing a dictionary randomly into train and test
sets, because the unseen words for this task are names of places. Some examples
S34
33
S32
Ljr314
72
0
291
100
10
Number of Pronunciations per Word
Figure 3-3: Word recognition performance with various number of pronunciations
generated using the graphone (L = 1, N = 6) and spellneme L2S converters.
of words in the unseen set are: galaxia, isabelles, knowfat, verdones, and republique.
Many of these words are foreign words, concatenation of multiple words, or purposelymisspelled words.
For using the top pronunciation only, WER results separated by seen and unseen
words are shown in Figures 3-4 and 3-5. For unseen words, we see that we are still
getting very good performance that are sometimes even slightly better than seen
words. One reason for this favorable performance is that the EM training process has
integrated smoothing. For unseen words, we also see a large dip at L = 1 and N = 6,
which indicates possible overtraining after the N-gram length goes past 6. This dip
is not present for the seen words.
For using the top 2 pronunciations, WER results for seen and unseen words are
shown in 3-6 and 3-7. It's interesting to see that both plots have dips in their WER
curves that indicate overtraining for very high N-grams. One explanation for this is
that many seen words in the training dictionary only have one pronunciation. While
the top L2S pronunciation may match the actual seen pronunciation exactly, the
--e- L=1
-e- L=2
-A- L=3
--4- L=4
1
2
3
4
5
6
Graphone N-gram Size
8
7
Figure 3-4: Word recognition performance for seen words when the top 1 pronunciation is used.
-e- L=1
-e- L=2
--- L=3
-- L=4
I
1
2
3
I
__ . _
.
..
_
..
_
.
.
......
4
5
6
Graphone N-gram Size
.
. ".."
. ..
7
.
..........
8
Figure 3-5: Word recognition performance for unseen words when the top 1 pronunciation is used.
' L=4
L=2
-a- L=2
L=3
-- L=4
1
2
3
4
5
6
Graphone N-gram Size
8
7
Figure 3-6: Word recognition performance for seen words when the top 2 pronunciations are used.
-e- L=I
1
2
3
4
5
6
Graphone N-gram Size
7
-+-
L=2
L---
L-.4
8
Figure 3-7: Word recognition performance for unseen words when the top 2 pronunciations are used.
second L2S pronunciation is only a hypothesized guess since it is not available in the
training data. Finally, we have also shown good generalizability to unseen words for
top 2 pronunciations, since the curves are similar to the ones for seen words.
3.4
Summary
In this chapter, we built a system that recognizes restaurant and street names, and
demonstrated that graphone models are excellent for L2S conversion of these words.
We further increased the word recognition performance of the system by incorporating more than one pronunciation for each word, using N-best hypotheses from the
L2S converter. Finally, we showed that graphones generalize well to unseen words
because their performance does not deteriorate for words not in the graphone training
dictionary.
50
Chapter 4
Lyrics Hybrid Recognition
This chapter discusses applying graphones to hybrid recognition in the lyrics domain.
Song lyrics often have many terms that are not actually English words. This provides
a good environment to evaluate the performance of hybrid recognition using graphone
language models. Graphone-based hybrid recognizers are built for various vocabulary
coverages levels, and evaluated on spoken lyrics data. The results of this chapter are
applicable for building a recognizer that could be integrated into a mobile or in-car
entertainment system that allows users to query for songs by speaking their lyrics.
The system constructed in this chapter also provides a foundation for moving to a
truly open vocabulary task in the lecture domain, as described in the next chapter.
4.1
Lyrics Data Set
The lyrics data was collected in a previous experiment that focused on hybrid recognition using spellnemes, as well as query songs by lyrics [12]. The data collection
process involved 20 subjects (13 males and 7 females), who were presented with 30second clips of songs. They were instructed to speak any segment of the lyrics for
that song based on what they heard. The subjects were also asked to type what they
said, which serves as the reference for speech recognition experiments. A total of 1k
songs were used in the data collection process. These songs were chosen from a set
of -37k songs while ensuring that it was not too difficult to recognize the lyrics from
listening to these songs. Lyrics for these songs were scraped from www.lyricwiki. org,
and were cleaned by removing non-ASCII characters as well as foreign songs. A foreign song identification script was written to recognize foreign songs by calculating
the proportion of non-English words in the song, and using a simple threshold as a
the classification criteria. The few remaining foreign songs were manually removed
from the data set. Words in the corpus were also cleaned by correcting misspellings of
words and splitting hyphenated words. These utterances can be harder to transcribe
than normal dictation speech because lyrics can contain popular culture slangs that
are not used in dictation speech. Using the same data for graphone experiments also
allows us to make some comparisons between spellnemes and graphones for hybrid
recognition.
4.2
Methodology
The lyrics of ~37k songs are used as the language model training corpus for the
recognizer. First, artificial vocabulary coverage levels are generated by ordering all
words in this corpus by decreasing frequency, and then choosing the set of most occurring words that provide the desired vocabulary coverage level (i.e. such that x%
of all corpus words, including duplicate words, is in the vocabulary). Depending on
the vocabulary coverage, it is possible that some in-vocabulary words actually do not
occur in the SLS pronunciation dictionary (-150k words), so their pronunciations
are initially unknown. In these cases, one pronunciation per word is generated using
a letter-to-sound (L2S) converter with L = 1 and N = 8, which are generally the
best parameters for L2S conversion [9]. This converter is trained on the SLS dictionary. Out of 46k unique words in the corpus, pronunciations of -15k words are
automatically generated, and the rest are copied directly from the SLS dictionary.
The next step involves segmenting out-of-vocabulary (OOV) words from the corpus into graphones. This is done by training a letter-to-graphone (L2G) converter,
with graphone size L = 4, using a 5-gram, on the SLS dictionary. This converter is
almost the same as the letter-to-sound (L2S) converter, except we do not include the
Vocabulary Size
68
120
233
492
1443
1942
2766
4386
8572
14532
Training Coverage
50.0%
60.0%
70.0%
80.0%
90.0%
92.0%
94.0%
96.0%
98.0%
99.0%
Test Coverage
53.0%
63.5%
74.3%
84.5%
93.4%
94.9%
96.4%
97.7%
99.0%
99.4%
Table 4.1: Vocabulary size of hybrid recognizers, and their vocabulary coverage levels
on the training and test sets.
FST that converts graphones into phonemes. OOV words in the training corpus are
replaced by their L2G output. The language model for the recognizer is then trained
on the resulting flat hybrid corpus. The set of L = 4 graphones are also added to
the recognizer's vocabulary. The rationale for choosing L = 4 is that this has yielded
the best results in Bisani and Ney's hybrid Wall Street Journal recognition task [8].
We have also conducted preliminary experiments with L = 3 or 4, which showed that
L = 4 produces better results. The acoustic models for these recognizers are trained
on telephone speech [43].
Several hybrid recognizers are built with training set vocabulary coverage levels
ranging from 50% to 99% (see Table 4.1). These recognizers are evaluated on the 1k
utterances from the lyrics data set through two methods: replacing each graphone sequence in the recognizer's output hypotheses by an OOV tag, and replacing graphone
sequences by the concatenation of the graphones' letter sequences. Replacing by an
OOV tag gives us an understanding of how graphones can prevent misrecognition of
words adjacent to an OOV term. Concatenating graphones gives us a sense of how
well the graphones spell out OOV words. Word-only recognizers are also trained on
the same vocabulary coverage levels for comparison with hybrid recognizers. Finally,
a word-only recognizer with 100% training corpus coverage is built as well.
Since this recognizer is geared toward song retrieval, most of the test utterances
are also in the recognizer's language model training corpus. The only situation where
the test utterance's reference transcription differs from the training corpus is when
the user incorrectly hears the song's lyrics, which did happen during data collection.
This factor causes the training corpus to have a 0.2% OOV rate on the test set. Nevertheless, because most of the phrases in test utterances are in the training data, this
task is mostly a closed-vocabulary task, and should perform better than completely
open-vocabulary tasks.
We evaluate the recognition performance on word error rate (WER), sentence
error rate (SER), and letter error rate (LER). WER tells us how many words we have
exactly correct in our recognition output, but can be overly harsh for words that the
graphones spell almost correctly. Therefore, we also evaluate our results on LER to
reward the recognizer for spelling words close to correct. SER tells us if we get entire
utterances correct, and is the most strict measure of correctness used for this study.
4.3
Results and Discussion
The results of this experiment show dramatic improvements for combating OOV word
by using a graphone flat hybrid model. The raw data for all graphs in this section is
located in Appendix B. Figure 4-1 shows WER recognition results for three setups.
The first setup is the baseline word-only recognizer that only knows words within the
specified coverage level. We see improvement in WER when we use a hybrid recognizer
and replace graphone sequences in the output by OOV tags. This improvement shows
that graphones are useful as an OOV detector, and helps us prevent words around
OOVs from being misrecognized.
Next, we see even more improvement when we
concatenate graphones to spell OOV words. This improvement indicates that in this
task, graphones are effective in hypothesizing pronunciations of OOV words.
We
get the most improvement for low vocabulary levels (50%), where using graphones
can improve WER by almost 40 percentage points. For very high vocabulary (99%),
there are still some performance gains for using the hybrid recognizer. The wordonly recognizer with 100% training vocabulary coverage yields a WER of 28.4% (see
Table B. 1), which is about equivalent to the performance of a hybrid recognizer with
90 -
--e- Word-only
-e- Word & OOV
Word & Graphone
-$-
80-
" 70
u'0 60-
5 50-
40
30-
102
103
Number of Words in Vocabulary
104
Figure 4-1: Recognition word error rate (WER) for recognizers with 50-99% training
vocabulary coverage. Word & OOV is replacing each output graphone sequence by an
OOV tag. Word & Graphone is concatenating consecutive graphones in the output
to spell OOV words.
90% coverage. Table 4.2 shows some examples of hybrid recognition output for 92%
vocabulary coverage level.
The sentence error rates (SER) for the three approaches are shown in Figure 4-2.
Here, replacing graphones by the OOV tag performs about the same as word-only
recognition, except at very high vocabulary coverage. This means that at low vocabulary coverages, there are just too many words that are OOV to see any improvement
for using graphones. However, at very high vocabulary coverage, some in-vocabulary
words that would have been misrecognized by the word-only recognizer, possibly because they have low probability, are actually recognized correctly as entire words
when the graphone model is added. We also see dramatic improvement when we
concatenate graphones to spell OOV words, which allows us to correctly recognize
some sentences with OOV words. Also, the 100% coverage word-only recognizer has
a SER of 67.5% (see Table B.1).
Finally, letter error rates (LER) are computed for these experiments (see Figure
Word-Only Recognizer T
Hybrid Recognizer
the tide at the gate
the time to [hesiltate ]
tonight this could you [tennlantlcisclo] days san
send until the night
[franlcisclo] nights
in the 1 bum of my mem- in the [alblum] of my
ory
memory
i can see you water and i can see you water and
moonlight be missing
moonlight [beamling]
she ring cloud it in crowd [chleerling]
clouds
in
[glit Iter ing crowlds]
a long story short and a [conlstanltlsource] of
your situation
your [frulstraltion]
Reference
the time to hesitate
san francisco days san
francisco nights
in the album of my memory
i can see water and
moonlight beaming
shimmering clouds glittering crowds
a constant source of your
frustration
Table 4.2: Example recognizer output at the 92% vocabulary coverage level. Square
brackets denote concatenated graphones (only letter part shown). Vertical bars are
graphone boundaries.
-e-*--
Word-only
Word & OOV
Word & Graphone
_
102
103
m
104
Number of Words inVocabulary
Figure 4-2: Recognition sentence error rate (SER) for recognizers with 50-99% training vocabulary coverage. Word & OOV is replacing each output graphone sequence
by an OOV tag. Word & Graphone is concatenating consecutive graphones in the
output to spell OOV words.
60
55
-e-- Word-only
Word & Graphone
q
50
80
M45
LU40
.,
35
"E 30
8
: 25
15
102
103
104
Number of Words in Vocabulary
Figure 4-3: Recognition letter error rate (LER) for recognizers with 50-99% training
vocabulary coverage. Word & Graphone is concatenating consecutive graphones in
the output to spell OOV words.
4-3). On this plot, we do not show results for replacing by the OOV tag because this
approach actually doesn't make sense with LER; we do not know how many letters
the OOV tag truly encapsulates. For LER, we also see dramatic improvements in
how much graphones can help spell out words correctly. Because the curve for using
graphones is much flatter than the graph for WER, we can conclude that many words
at low vocabulary coverages are actually spelled very close to its true spelling. The
100% word-only recognizer has a LER of 18.5% (see Table B.1), which is similar to
the LER performance of the 80% hybrid recognizer with only a 492-word vocabulary.
We can also compare these graphone results with the spellneme hybrid results
reported in [12]. In the spellnemes study, a hybrid recognizer is built by converting
OOV words to spellnemes, just as in graphone hybrid recognition. Although WER
and SER results are reported for word-only and replacing spellneme sequences by
an OOV tag, no results are reported for concatenating spellnemes together.
For
the replacing subwords by OOV approach, graphones perform better by about 10
percentage points for low vocabulary coverage. This advantage gradually decreases
as we increase the vocabulary coverage, and at very high vocabulary coverage, the
two methods perform about the same in WER. For SER of replacing by the OOV
tag, spellnemes perform about 10 percentage points better than graphones at low
vocabulary coverages, and perform about the same as graphones at high vocabulary
coverage. One reason could be that at very low vocabulary coverages, graphones
are more eager to appear at places where an in-vocabulary word should appear, but
may still end up with the same spelling anyways. However, these sentences are still
penalized because of they contain OOV tags. Therefore, WER is a better measure of
performance than SER for this task.
4.4
Controlling the Number of Graphones
Controlling the size of the graphone set can be important for applications where the
size of the recognizer's lexicon is limited. Because graphones are added directly to
the recognizer's lexicon, they can also increase the size of the resulting recognizer.
With the untrimmed L = 4 graphone model, 22k graphones are needed to segment
all words in the training corpus. The L2G model used for this segmentation has 54k
unique graphones. Thus for the 50% coverage vocabulary, although only 68 words
are in-vocabulary, almost 22k graphones are needed to model the OOV words. For
high vocabulary coverages, the number of graphones are considerably less because
less words are considered OOV in the training corpus. Nevertheless, it is beneficial
to investigate trimming the graphone set.
The size of the graphone set is reduced by modifying the procedure for training
graphones on the pronunciation dictionary. First, the unigram EM is run to convergence with the Kneser-Ney evidence discount parameter automatically chosen by
Powell's method. For L = 4, this discount value is 0.07. We then run the unigram EM
again with a fixed discount value to trim the graphone set to the desired value. For
a sense of comparison, graphone evidence values accumulated over the entire training dictionary can be as big as 5k to 10k for common graphones like (s/s), (er/er),
and (ing/ih ng); and as small as allowable by machine precision. Finally, we use the
Discount Value
0.07
3.0
5.0
8.0
10.0
20.0
Number of Graphones
54k
1lk
7k
5k
4k
2k
Table 4.3: Number of graphones for L = 4 unigram model after trimming using the
specified Kneser-Ney discount value.
SER
WER
LER
Full Graphone
72.7%
31.6%
18.8%
4k Graphone
75%
34.0%
20.3%
2k Graphone
78.0%
37.1%
22.5%
Word-Only
86.0%
50.1%
32.6%
Table 4.4: Comparison of recognition performance for the full graphone hybrid model
and trimmed graphone models (4k and 2k graphone set size). We also show results
for the word-only recognizer. All of these recognizers have 80% vocabulary coverage.
trimmed model to ramp up to higher graphone N-grams using same procedure as
before. Table 4.3 shows how the size of the graphone set changes as the evidence
discount parameter is varied.
For the lyrics experiments, we ramp up the 4k and 2k trimmed unigram graphone models to 5-grams, and then use these models to segment OOV words at 80%
vocabulary coverage. Table 4.4 shows results for concatenating graphones to spell
OOV words. These results indicate that trimming the graphone set does degrade the
hybrid's recognition performance, but this effect is still small compared to the performance of the word-only recognizer. Even after reducing the number of graphones
used in the hybrid model by 10-fold to 2k, we still achieve recognition performance
that is significantly better than word-only recognition.
4.5
Summary
In this chapter, we applied graphone hybrid recognition to music lyrics utterances. We
demonstrated significant performance improvements over the word-only recognizer at
all vocabulary coverage levels tested. By replacing graphone sequences with OOV
tags, we showed that using the hybrid recognizer can reduce errors for in-vocabulary
neighbors of OOV words. By concatenating graphone sequences to spell OOV words,
we improved recognition performance even further, demonstrating that graphones can
correctly spell OOV words. Finally, we explored trimming the graphone set during
EM training, and showed that even with a much smaller graphone set, the hybrid
recognizer still significantly outperforms the word-only recognizer.
Chapter 5
Spoken Lecture Transcription
This chapter explores using graphone hybrid recognition for spoken lecture transcription. This task is an open-vocabulary task because we cannot anticipate everything
the lecturer says in the recorded audio, especially if it's an unknown topic. Our main
objective is to build a hybrid recognizer and compare this recognizer's performance
with a word-only recognizer at various vocabulary coverage levels. We also explore
several extensions for improving the hybrid recognizer, including trimming the graphone set, using a full vocabulary with the hybrid model, and hierarchical models.
5.1
Lecture Data Set
The lecture data is mainly a set of manually-transcribed lectures from the MIT OpenCourseWare and MIT World databases. The data was originally used in [22] to develop a lecture transcription engine so that the lecture text can be easily accessed from
a web interface. The training corpus for the speech recognizer has 480k utterances
comprised of 6.8M words. It not only contains transcribed lectures, but also contains
data from the Switchboard corpus and the Michigan Corpus of Academic Spoken
English (MICASE). There are a total of 49.3k unique words in this training corpus,
which would be the vocabulary size of a word-only recognizer with 100% training corpus coverage. The test set has 7.4k utterances containing 72k words, selected from
lectures on computer algorithms, speech recognition systems, biology, and differential
equations. The entire training corpus has a 99.4% vocabulary coverage on the test
corpus.
The lecture data provides us with a practical application that has good opportunities for working with OOV words that are likely to be important. In these lectures,
the OOV words are often technical terms that are crucial for the understanding of
the lecture. Users also are likely to use these terms as keywords for searching through
the lectures transcript. Therefore, the ability to hypothesize spellings for these OOV
words can significantly increase the value of the lecture-browsing application.
5.2
Methodology
A baseline word-only recognizer and a hybrid recognizer are built for several vocabulary levels, ranging from 88% to 99% coverage on the training corpus. Most of the
steps for building the recognizers are analogous to those for the lyrics recognizer described in the Chapter 4. In-vocabulary words for each coverage level are selected by
decreasing frequency in the lecture training corpus. Table 5.1 shows the vocabulary
sizes of the hybrid recognizers, as well as the corresponding vocabulary coverage levels
on training and test sets. Finally, we also build a word-only recognizer using 100%
of the training vocabulary.
For building a word-only recognizer, pronunciations are needed for all words that
are considered to be in-vocabulary. Most of the 49.3k words in the training dictionary
are found in the SLS pronunciation dictionary, but 4.2k of those words are missing
pronunciations. For those words, a graphone letter-to-sound converter (L = 1, N = 8,
trained on the SLS dictionary) is used to automatically generate one pronunciation for
each word. For the hybrid recognizer, OOV words in the training corpus are replaced
by their graphone representations using a letter-to-graphone (L2G) converter, which
is trained on the SLS dictionary with L = 4 and N = 5. These graphones are also
added to the recognizer's lexicon. For both word-only and hybrid models, we use
the SRI Language Modeling Tookit (SRILM) [36] to build their recognizer language
models. For this step, a trigram model is built on the training corpus using Witten-
Vocabulary Size
1320
1780
2468
3594
5661
10633
17374
Training Coverage
88.0%
90.0%
92.0%
94.0%
96.0%
98.0%
99.0%
Test Coverage
85.6%
88.4%
90.4%
93.0%
95.0%
97.2%
98.4%
Table 5.1: Vocabulary sizes for the hybrid recognizer, and corresponding vocabulary
coverage levels on the training and test sets.
Bell smoothing and a N-gram pruning threshold of 0.000001. The acoustic models for
this experiment are the same as those in the original lecture transcription recognizer,
which is trained from about 121 hours of speech from MIT lectures. This recognizer
also already has models for common non-speech utterances such as <um>, <laugh>,
<cough>, etc.
These recognizers are tested on new data from lectures not in the training set.
Recognition performance are measured by computing word error rate (WER), sentence error rate (SER), and letter error rate (LER) for each recognizers's output.
For LER, all spaces in the recognizer's output and the reference text are replaced
by a special token to take into account of whether consecutive words are separated
correctly in the recognizer's output.
5.3
Results and Discussion
The hybrid model performs well when compared to word-only recognizers with the
same vocabulary coverage. The raw data for all graphs in this section is available in
Appendix C. Figure 5-1 shows WER for a) the word-only recognizer, b) replacing
each graphone sequence in the recognizer output with an OOV tag (Word & OOV),
and c) concatenating consecutive graphones in the output to spell out OOV words
(Word & Graphone). The trend here is similar to the ones depicted in Chapter 4
for lyrics recognition. The difference between word-only and replacing by an OOV
tag shows that graphones help us reduce mistakes near OOV words. The difference
Word-Only Recognizer
whatever they're using job
that now i guess <uh> to
do way lower
these different kinds of like
rocks old and hydrogen
artistic models are made
out of <uh> gauss in mixture switch
the right missiles are responsible for proteins and
this is
individual molecules and
the scientist kelvin
Hybrid Recognizer
Reference
whatever they're using whatever they're using
[java] now i guess <uh> java now i guess to do
to do [euller]
euler
these
different
kinds these different kinds of
of a [hydrloxlalls] and hydroxyls and hydrogens
[hydr ogen s]
are [acoustic]
models our acoustic models are
are made up of <uh> made up of <uh> gaus[gauslslian] makes [turels] sian mixtures which
which
the
[ribolsomels]
are the ribosomes are reresponsible for protein sponsible for protein
[synltheslis]
synthesis
individual molecules and individual molecules and
the [cytolskelleton]
the cytoskeleton
Table 5.2: Example recognizer output for the hybrid and word-only recognizers at
96% training corpus coverage, along with reference transcriptions. Square brackets
indicate concatenation of graphones (only letter part shown). Vertical bars denote
graphone boundaries.
between replacing by an OOV tag and concatenating graphones shows that we can
spell some OOV words correctly using graphones. The performance benefit of using
graphones diminishes as we increase the recognizer's coverage. The more coverage
we use, the less OOV words we have to segment into graphones, causing the hybrid
language model to have less data about graphone-to-graphone transitions, as well as
word-to-graphone transitions. Table 5.2 shows some examples of the hybrid recognizer
correctly or almost-correctly spelling OOV words. This hybrid recognizer has 96%
training corpus coverage, equivalent to 95% test corpus coverage.
Also note that
sometimes the hybrid recognizer will recognize an in-vocabulary word as a sequence
of graphones, but usually with the correct spelling.
The comparison for SER is shown in Figure 5-2.
One trend is that replacing
by the OOV tag consistently does slightly worse than word-only recognition. This
could mean that some in-vocabulary words are consistently recognized into their
equivalent graphone form. Replacing by the OOV tag does not turn these words back
to their original spelling, so the sentences are counted as incorrect. However, after
55
--------
Word-only
Word & OOV
Word & Graphone
50
45
0
a 40
- 35
30
103
104
Number of Words inVocabulary
Figure 5-1: Recognition word error rate (WER) for 88-99% training vocabulary coverages. Word & OOV is replacing each output graphone sequence by an OOV tag. Word
& Graphone is concatenating consecutive output graphones to spell OOV words.
the graphones are concatenated into words, almost all of these words are correctly
spelled back into their original form.
By inspecting the hybrid recognizer's output, we notice that the hybrid recognizer
is very good at spelling words that are also in the training corpus, but are considered
as OOV based on the chosen vocabulary coverage. Since these graphone N-grams
exist in the training corpus, the recognition process is much easier. For OOV words
that do not occur in the training corpus, the task of spelling them with graphones is
much harder. In most cases, the recognizer only produces a partially-correct spelling
for these OOVs, which is counted as incorrect when calculating WER. To reward
partially-correct spellings, LER is calculated for word-only and hybrid recognizers, as
shown in Figure 5-3. The LER results show significant improvements for using the
hybrid model. The curve for the hybrid model is also very flat, which indicates that
even at lower vocabulary coverage levels, graphones are still effective in spelling many
words mostly correctly.
Finally, we also note that none of these hybrids outperform a word-only recognizer
-e81
-a-80
(E
Ir
Word-only
-a- Word &OOV
Word & Graphone
79
o 78
LU
)
77
- 76
0)
c 75
5 74
8
IM 73
71
10
10
4
Number of Words inVocabulary
Figure 5-2: Recognition sentence error rate (SER) for 88-99% training vocabulary
coverages. Word & OOV is replacing each output graphone sequence by an OOV
tag. Word & Graphone is concatenating consecutive output graphones to spell OOV
words.
-e---
Word-only
Word & Graphone
24k
16L
10
104
Number of Words inVocabulary
Figure 5-3: Recognition letter error rate (LER) for 88-99% training vocabulary coverages. Word & Graphone is concatenating output graphone sequences to spell OOV
words.
with 100% training corpus vocabulary coverage, which includes 49.3k words with 4.2k
pronunciations automatically generated by graphone L2S. As shown in Table C.1, the
performance measures for this full-coverage word recognizer are: WER of 31.1%, SER
72.0%, and LER of 17.6%. The 99% hybrid has a WER 31.8%, SER 72.7%, and LER
17.8%. The 98% hybrid has a WER of 32.2%, SER 73.0%, and LER 18.0%. One
factor could be that although the training and test corpora are from different lectures,
the training corpus can still cover most of the words in the test corpus, so truly OOV
words are not abundant in the test corpus. In fact, only 204 of the 4k unique words in
the test set are not in the training corpus. Occurrences of these words account for 440
of the 72k total words in the test set. In this case, running with a large-vocabulary
recognizer can offer more constructive constraints in the search space. The results
however, are still very close between the full-coverage word-only recognizer and highcoverage hybrid recognizers.
Although the measured performance of hybrid recognizers is not better than the
full-coverage word-only recognizer, there are still important advantages for choosing
the hybrid approach. One advantage is vocabulary size. For the 98% hybrid recognizer, even with the graphones added, the recognizer's lexicon is still 20k smaller
than the full-coverage word recognizer's, which can offer computational resource advantages. The big difference in vocabulary size is not limited to the lectures domain
because data sparsity is an issue for many domains. The hybrid model also has the
advantage of hypothesizing spellings of OOV words, rather than recognizing it into
a completely different in-vocabulary word. This is important if the OOV word is a
keyword for the lecture. In Table 5.3, we show some examples of 100% word-only
recognition output and the 98% coverage hybrid, especially when the hybrid model's
ability to hypothesize arbitrary words is useful. In addition, the hybrid model maybe
be more useful if the training corpus is more limited. In this task, we have the luxury
of training with a fairly large training corpus of transcribed lectures, but such corpora are not available for many applications. With a smaller corpus, even if we get
pronunciations for all the words in the corpus, we still may not be able to account
for a significant number of words in the test data, in which case open-vocabulary
Word-Only Recognizer
(100% Coverage)
linda was on great got
their names attach to a
version of this algorithm
Hybrid Recognizer
Reference
(98% Coverage)
[lindlouzo] and great got their linde buzo and gray got
names attach to a version of their names attached to a
this algorithm
version of this <uh> algorithm
the outside it molecule
the [lephlatic] molecule
the allophatic molecule
this kind of reaction jus- this
kind
of
reaction this kind of reaction esterifitification is the kind of [ystilficaltion ] is the kind cation is the kind of linkage
linking
of linking
in the cepstral domain we in the [ceps tral] domain we see in the cepstral domain we
see that disputes
that the speed [s]
see that the speech
the mitochondrial d n a the mitochondria [1] d n a from the mitochondrial d n a
from snort meander full [son] art [nealnderlthal] bones from <partial> <partial>
bones
neanderthal bones
here's the tube lists are here's the two [glyclerloles] here's the two glycerols once
also once again here's the once again here's the phos- again here's the phosphate
phosphate
phate
Table 5.3: Examples of recognition output for 100% coverage word-only recognizer
and a 98% coverage hybrid recognizer, along with the corresponding reference transcriptions. Square brackets denote concatenated graphones. Vertical bars show graphone boundaries. Note that this table differs from Table 5.2 in that the hybrid and
word-only recognizers have different coverage levels.
recognition may perform better.
5.4
Controlling the Number of Graphones
For the full L = 4 graphone model, 19k unique graphones are needed to segment all
words in the corpus into graphones. These 19k graphones are added to the speech
recognizer's lexicon, thus significantly increasing the size of the lexicon. In fact, for
the coverage levels explored in the lecture experiments (88% - 99%), the majority of
the lexicon is actually graphones. This leads to the question of whether we can trim
the graphone set and still preserve performance for open-vocabulary recognition. The
graphone set trimming process is described in Section 4.4.
For this experiment, 96% coverage hybrid recognizers are built with trimmed graphone sets of 4k and 2k. These recognizers are then evaluated on the test set. The
WER
SER
LER
Full Graphone
32.8%
73.5%
18.2%
4k Graphone
33.6%
73.6%
18.6%
2k Graphone
34.8%
74.5%
19.1%
Word-Only
36.8%
75.1%
20.7%
Table 5.4: Comparison of recognition performance for the full graphone hybrid model
and trimmed graphone models (4k and 2k graphone set size). Results are also shown
for the word-only recognizer. All of these recognizers have 96% vocabulary coverage
on the training corpus.
results of these recognition experiments are shown in Table 5.4. The speech recognition performance only degrades slightly when the graphone set is trimmed down from
19k to 4k. The difference between these two setups are less than 1 percentage point
for all three metrics. Therefore, the complexity of the recognizer can be significantly
reduced while maintaining a similar level of recognition performance.
5.5
Hybrid Recognition with Full Vocabulary
Choosing a vocabulary coverage for the hybrid model has a trade-off. For higher
vocabulary coverage, more words are in-vocabulary, but there is also less graphone-tographone transitions available in the training corpus for learning. This can reduce the
graphones' ability to spell OOV words. For lower vocabulary coverage, although there
are more graphone-graphone transitions available, a smaller vocabulary decreases
overall performance. The goal of this section is to build a recognizer that has 100%
vocabulary coverage of the training corpus, but also has hybrid model knowledge to
help it perform even better.
In order to do this, the recognizer must learn N-grams of all words in the training
corpus, as well as N-grams containing graphones. We can accomplish this by training
a hybrid recognizer on the concatenation of two copies of the language model corpus.
The first copy is the original corpus, and the second copy is a hybrid corpus at a
chosen vocabulary coverage. The first copy gives us full vocabulary coverage, and the
second copy incorporates word-graphone and graphone-graphone transitions into the
language model.
190
188
186
184
U)
t5182
I-
" 180
(D
178
176
174
172
170 L
0
2
1
4
6
8
Number of Copies of Training Corpus
I
10
Figure 5-4: Test set perplexity of language models trained on a replicated training
corpus.
However, care must be taken when concatenating a language model corpus with
itself. Depending on the effects of smoothing, the language model estimated on the
same text replicated twice or more times can lead to a different, and perhaps better
model. For example, if a graphone model trained on two copies of the training corpus
(1 word-only, 1 hybrid) does better than the word-only recognizer, then we are still
not sure if the improvement is due to having some sentences appear twice, or from
the addition of graphones. In order to have a fair comparison with the proposed
graphone model, we must find a point where concatenating additional copies of the
language model corpus no longer increases the quality of the language model. To do
this, language models are trained on replicated word-only lecture corpora of varying
number of copies.
Then, their perplexities on the lecture test set are calculated
using SRILM. Figure 5-4 shows the result of this perplexity calculation. Because the
perplexity curve flattens as we get to 10 copies of the training corpus, we use this
configuration to test the full-vocabulary graphone model. For practical applications,
these perplexity calculations can be done on a development set.
WER
SER
Word-only
1x
31.1%
72.0%
Word-only
10x
29.0%
69.7%
LER
17.6%
16.5%
96% Hybrid
OOV
Gr
29.3% 29.3%
70.1% 70.1%
-
16.5%
98% Hybrid
OOV
Gr
29.2% 29.2%
69.9% 69.9%
-
Table 5.5: Performance of hybrid recognizers with full
full-coverage baseline word-only recognizers. Word-only
and ten copies of the training corpus, respectively. The
graphone sequences by OOV tags. The Gr column is
together to form words.
16.5%
99% Hybrid
OOV
Gr
29.1% 29.1%
69.8% 69.8%
-
16.5%
vocabulary compared to the
Ix and 10x are built with one
OOV column is for replacing
for concatenating graphones
For this experiment, nine copies of the original corpora are concatenated with one
copy of the hybrid corpus at a given coverage level (96%, 98%, or 99%). The 4k graphone model (L = 4) as described in Section 5.4 is used for graphonizing OOV words
in the training corpus. This trimmed graphone set is used because these graphones
are added to an already-huge dictionary of 49.3k words. The hybrid recognizer built
on the resulting training corpus has 100% training vocabulary coverage, plus knowledge about graphones. The hybrid results are compared to a word-only recognizer
trained on 10 copies of the original corpora. Table 5.5 shows results of this experiment. For reference, this table also includes the results of the word-only recognizer
trained on one copy of the training corpus. These results show that as predicted from
perplexity calculations, replicating the training corpus does increase performance, but
this is likely to be a side-effect of smoothing. Comparing the 10x word-only language
model and corresponding hybrid recognizers (lx hybrid + 9x word-only corpus), we
do not see any gains with the hybrid recognizers. We also do not see much decrease
in recognition either, especially in LER. Comparing the columns for replacing by an
OOV tag and for concatenating graphones, we see that at such a high coverage level,
the graphones rarely spell OOV words correctly. Because the training set's coverage
of test set words is very high for this task, it is also good too see that adding graphones to a full-coverage recognizer does not significantly degrade recognition. If this
full-vocabulary hybrid were built for a different task where the training corpus has
a much lower coverage of the test corpus, then the full-vocabulary hybrid has much
greater potential for outperforming the full-coverage word-only recognizer.
5.6
Hierarchical Alternative for Hybrid Model
So far, when we mention hybrid models, we have been referring to fiat hybrid models.
However, there are alternatives to the flat approach for building a hybrid model. In
this section, we experiment with hierarchical hybrid models, or simply hierarchical
models for short. These models are hierarchical because they model the language
corpus at two levels: the word level and the subword level. At the word-level, OOV
words are replaced by OOV tokens during language model training. The subwordlevel model is trained on an arbitrary pronunciation dictionary, and serves as the
OOV model. During recognition, the top-level recognizer decides whether an OOV
word has occurred, and if so, the recognizer uses the subword OOV model to find the
most-likely OOV word (see Figure 5-5). A cost factor, Coo, is applied whenever the
recognizer uses the OOV model, which can be seen as a normalizing factor between
the OOV model and the word-level model. This cost is additive in the log-probability
domain, so it is multiplicative in the probability domain. Bazzi and Glass have used
phonemes for the OOV model in the hierarchical framework to generate phonetic
sequences for OOV words [4]. We build on previous work by using graphones in the
OOV model to spell OOV words.
There are pros and cons of using a hierarchical model versus a flat model to detect
and hypothesize OOVs. The flat model knows about transitions from words to the
beginning graphone of an OOV, as well as the transition from the ending graphone
of an OOV to the next word. When there are consecutive OOVs in the training
corpus, the flat model also learns transitions between graphones at the boundary of
the OOVs. The hierarchical model may have better generalizability for spelling OOV
words because its OOV model can be trained on a large dictionary that is independent
of the language model training corpus for the recognizer.
The flat hybrid on the
other hand, only learns how to spell OOV words from the segmented words in the
language model training corpus. At high vocabulary coverages, this is usually a much
smaller data set than a large training dictionary. The hierarchical model can capture
probabilities that span words on both sides of the OOV token, but the flat model
Figure 5-5: Hierarchical OOV model framework (from [2]). Woov represents the
OOV token.
cannot, unless the OOV word is very short. The flat model doesn't require parameter
tuning, while the flat hybrid model requires tuning of Coo, on a development set.
However, tuning this parameter also gives us more control of how often to go into the
OOV model, which can help us prevent in-vocabulary words from being recognized
as OOVs.
To build the hierarchical model for this experiment, a graphone model is trained
on the SLS pronunciation dictionary with parameters L = 3, N = 3, resulting in
a set of 19.6k graphones. This model is then converted to an FST using the MIT
Finite-State Transducer Toolkit [26]. The Coo, factor is added by inserting a node at
the beginning of this FST. For the word-level model, the language model is estimated
using SRILM at 98% training set vocabulary coverage, replacing OOV words by an
OOV token. This language model is then converted to an FST where the OOV token
is converted to a dynamic class that points to the OOV model's FST. A 400-utterance
development set is used to tune Coo, by optimizing on WER. The best Coo, found
for this model is -7.5 (FST scores are negative-log probabilities). This negative cost
is needed because a hypothesized word generated by the OOV model has a much
higher score than a normal word. The word-transition-weight for the hierarchical
recognizer is also adjusted slightly to compensate for the addition of the OOV model.
Performing a backward pass on the top-level model is tricky because we need to skip
over all the OOV tokens. For this proof-of-concept experiment, we run a forward
trigram and widen the search beam by a factor of 5 to compensate for the much
larger search space. We plan to implement the bigram forward pass and a trigram
backward pass for this model in the near future.
For evaluation, the hierarchical and flat hybrids, along with a word-only recognizer
on a 7k test set. The flat hybrid uses a L = 3, N = 5 L2G converter for graphonizing
OOV words in the training set. All of these recognizers have a vocabulary coverage
of 98% on the training set. The recognition results are shown Table 5.6. These
results show that the hierarchical and flat hybrid models have similar performance,
and both significantly outperform the word-only recognizer at this coverage level.
Several examples of recognition output for the three types of recognizers are shown
in Table 5.7. These examples show that although the flat hybrid model and the toplevel N-gram of the hierarchical model are trained on corpora with the same OOV
locations, the recognition output can still vary in where OOV words are detected.
The hypothesized spellings are often different for the two hybrid models because the
training data for the graphone part of these models are different.
Although the LER of the hierarchical model is slightly worse than the hybrid
recognizer, a closer examination of the recognizers' output actually reveals that the
hierarchical model generalizes better to truly unseen data. There are 204 unique
words in the test set that occur nowhere in the training corpus. Out of those unseen
words, the flat hybrid spells 10 of those words correctly (including placing the word
boundaries correctly) at least once in the test set. The hierarchical hybrid however,
spells 41 of those truly unseen words correctly at least once in the test set. This
shows that if the OOV position is hypothesized correctly, then the hierarchical hybrid
is more capable of spelling the OOV word. This is expected because the hierarchical
OOV model is trained on a 150k-word dictionary rather than only the words that
WER
SER
LER
Word-only
33.6%
73.2%
19.0%
Flat
32.5%
72.8%
18.2%
Hierarchical
32.5%
72.8%
18.3%
Table 5.6: Comparison of hybrid approaches for the lecture transcription task at 98%
vocabulary coverage.
get graphonized at the 98% coverage level on the training corpus. In addition, EM
with smoothing is run for all N-gram sizes when building the graphone model for the
hierarchical hybrid, which is not the case for graphone N-grams in the flat hybrid.
Therefore, this confirms one strength of hierarchical OOV models over flat hybrid
models.
5.7
Summary
In this chapter, we built a hybrid recognizer for the lectures domain and demonstrated
significant improvements over the word-only baseline recognizer. We also showed that
trimming the set down to 4k graphones does not significantly impact performance. A
full-coverage hybrid recognizer was also built and shown to produce less errors than
the full-coverage word-only recognizer. Finally, we presented a hierarchical graphone
model that yields similar performance as the flat hybrid graphone model.
Hierarchical
<uh>
represents
[corlbellhydlraltes]
summary tell me
[asy mptIot Iica Illy]
here's
the
[gol gilap brlede son]
the
gold
yet
[applaraltus]
up
here
this so late of the
triangle this the
[hyp otenuse] have
flow a and
and the truth things
that vast numbers
of [biolchelmiclallly]
catches are made by
[esplirilficlatilon] reactions and reversed
by reactions that
are called simply
[hyd Iroly sis]
Flat
<uh>
represents
[carlbolhydlraltes]
summary tell me
[asylmptlot licallly]
here's
the
goal
g
[applaraltus]
in the gold yet
[applaraltus]
up
here
this so late of the
triangle this the
hype
understand
flow a and
and
the
truth
things that vast
numbers of biochemical [inklalges]
are made by us
[verlifilcatlion] reactions and reversed
by reactions that
are called simply
[hydlroisis]
Word-only
<uh> represents
car behind rates
summary tell me
hasn't ironically
here's the goal of
yep bread so the
gold yet <uh>
bread is up here
Reference
<uh> represents
carbohydrates
somebody tell me
asymptotically
here's the golgi apparatus and the
golgi <uh> apparatus up here
this so late of the
triangle this the
hype understand
flow a and
and
the
truth
things that vast
numbers of biochemical
the
teachers are made
by a spirit vacation
reactions
and reversed by
reactions that are
called simply high
gross is
this side of the triangle this the hypotenuse has slope
an
and the truth is
that vast numbers
of
biochemical
linkages are made
by
esterification
reactions and reversed by reactions
that are called
simply hydrolysis
Table 5.7: Examples of recognition output from hierarchical, flat, and word-only
models. All recognizers have a 98% training corpus coverage. Square brackets denote
concatenated graphones (only letter parts shown). Vertical bars show boundaries
between graphones.
Chapter 6
Summary and Future Directions
In this thesis, we explored using graphones for letter-to-sound (L2S) as well as hybrid
recognition tasks, in several applications with practical use cases. We discussed L2S
conversion in context of generating restaurant and street names for a mobile restaurant guide application, and produced results slightly better than similar subword
approaches.
We then presented results of using graphone flat hybrid recognition
to recognize lyrics queries issued by users with the goal of finding songs. Finally,
we applied graphones to transcribing spoken lectures in an open vocabulary setting.
Overall, we conclude that the flat hybrid model is effective when compared to wordonly models with the same vocabulary coverage. We have also seen that trimming
the graphone set does not significantly degrade recognition performance, especially
for the open vocabulary task. Finally we explored some improvements to the standard flat hybrid model by incorporating all the words in the vocabulary, as well as
building a hierarchical hybrid instead of a flat hybrid.
6.1
Future Directions
One future area of focus is improving the flat hybrid model. We can add a word
boundary token to graphonized OOVs in the language model training corpus so that
we can easily separate consecutive OOV words in the recognition output. This can
be especially useful if the OOV rate is high. Also, the graphone size parameters
has trade-offs for hybrid recognition performance because large graphones are good
for recognition, but are not as good as smaller graphones for graphonizing words
accurately. However this trade-off may be mitigated by using a long N-gram L2S
to get pronunciations for OOV words, and then jointly segmenting the OOV words
into graphones [39]. Although Vertanen does not find improvements for using this
technique on the WSJ task, the performance of this technique might be domain
dependent. For future work, we can explore this technique for the lecture and lyrics
domains.
We also presented a hierarchical graphone model in this work, but there is much
more to be studied in this area. We can dynamically adapt OOV models for the
hierarchical hybrid recognizer to improve its ability to hypothesize OOV spellings.
The hierarchical model can also be combined with a flat hybrid approach through a
two-stage configuration. In this system, the first stage uses a flat hybrid model to
identify OOV regions. The second stage rescores these regions with a more powerful
graphone-only OOV model similar to the one used for the hierarchical hybrid.
Although building a hybrid model allows us to detect OOVs, we also lose training
corpus words from the vocabulary when we decide on a vocabulary coverage for the
hybrid model (applies to both flat and hierarchical models). The words excluded from
the hybrid model could have been added to the speech recognizer's vocabulary after
L2S conversion. The best solution should have the best of both worlds. It should
allow the recognizer to use pronunciations of all words in the training corpus, plus
information offered by a hybrid model. In this work, we explored a preliminary approach, which is to simply concatenate the word-only corpus with a hybrid corpus for
LM training. However, more advanced techniques can be used. One approach could
be to merge a large-vocabulary word-only model and a flat-hybrid model through
language modeling techniques. Another approach could consider every word in the
training corpus to have a some partial probability of being considered as OOV, so
that we can effectively learn word-to-word and word-to-graphone transitions for all
words in the training corpus.
When the graphones are trained by maximum-likelihood training, we have seen
that the L2S models with larger graphones do not perform as well as the ones with tiny
graphones. Ideally, the model with bigger graphones should contain the model with
the smaller graphones. If the best model were the model with the smaller graphones,
then the ideal training process should always produce the smaller-graphone model,
regardless of the specified graphone size upper limit. In the current training process,
the smaller graphones are able to take significant advantage of language modeling
techniques, but bigger graphones don't get as much benefit because each word does
not contain as many big graphones. One possibility is to introduce some form of
backoff and smoothing along the graphone size dimension in a manner analogous
to the existing backoff and smoothing along the N-gram size dimension. Another
possibility is to use a prior based on linguistic knowledge and turn the maximumlikelihood training into a maximum a posteriori training process. This will be helpful
for languages where we have some linguistic knowledge about syllable size or structure.
80
Appendix A
Data Tables: Restaurant and
Street Name Recognition
I
LN
1
1
1
1
1
1
1
1
2
2
2
2
2
2
3
3
3
3
3
4
4
4
4
4
1
2
3
4
5
6
7
8
1
2
3
4
5
6
1
2
3
4
5
1
2
3
4
5
Top 1 Pronunciation
All
Seen
Unseen
65.0
65.5
64.8
47.2
47.3
47.1
38.9
41.2
37.8
38.2
38.7
38.0
36.0
36.6
35.7
35.2
35.8
34.9
35.6
35.5
35.7
35.6
35.5
35.7
58.3
60.2
57.5
38.8
39.8
38.4
37.4
37.2
37.6
37.0
36.6
37.3
36.8
36.4
37.0
36.8
36.4
37.0
51.7
51.8
51.7
37.5
36.6
37.9
36.6
35.6
37.0
36.6
35.6
37.0
36.6
35.6
37.0
44.4
43.0
45.1
37.8
36.4
38.4
37.6
36.1
38.2
37.5
36.1
38.1
37.5
36.1
38.1
__
II
Top 2 Pronunciations
All
Seen
Unseen
65.5
65.8
65.4
42.9
43.3
42.8
35.2
36.6
34.6
33.1
31.8
33.7
32.5
31.0
33.2
31.6
30.0
32.4
32.0
30.7
32.6
32.0
30.8
32.6
55.0
55.9
54.5
33.5
33.5
33.5
33.1
31.0
34.1
33.1
32.4
33.5
32.9
32.4
33.2
32.9
32.4
33.2
45.5
45.7
45.4
34.0
34.0
34.0
33.8
32.9
34.3
33.8
32.9
34.3
33.8
32.9
34.3
40.2
39.6
40.5
34.5
32.9
35.2
34.4
32.7
35.1
34.4
32.7
35.1
34.4
32.7
35.1
Table A.1: Word recognition performance using automatically-generated pronunciations for restaurant and street names. Graphone L2S converters with various L and
N values are used to generate pronunciations. Results are shown in WER (%). This
data is used to generate Figures 3-1, 3-2, 3-4, 3-5, 3-6, and 3-7.
Num. Pron.
1
2
3
5
10
20
50
Graphone
35.2
31.6
30.6
30.3
30.0
30.3
32.3
Spellneme
37.1
32.1
30.6
30.2
29.7
30.4
31.1
Table A.2: Word recognition performance with various number of pronunciations.
The graphone converter is trained with L = 1, and N = 6. The spellneme data is
from [13]. Results are shown in WER (%). This data corresponds to Figure 3-3.
Appendix B
Data Tables: Lyrics Hybrid
Recognition
Ctrain
Ctest
IVI
50.0
60.0
70.0
80.0
90.0
92.0
94.0
96.0
98.0
99.0
100.0
53.0
63.5
74.3
84.5
93.4
94.9
96.4
97.7
99.0
99.4
99.8
68
120
233
492
1443
1942
2766
4386
8572
14532
46937
Word-only
WER SER LER
88.1
99.3 55.5
79.4
97.5 51.2
65.3
93.7 41.9
50.1
86.0 32.6
36.9
75.7 23.9
34.8
73.3 22.3
32.5
71.1 21.1
30.5 68.9 19.9
29.2 67.5 19.0
28.6 67.5 18.6
28.4 67.5 18.5
"
Word & OOV
WER SER
63.4
99.4
55.9
98.0
47.3
94.3
39.1
87.3
31.7
76.2
30.5
74.3
29.2
70.9
28.0
68.9
27.3
66.4
26.6
65.8
"
Word & GraDhone
WER SER LER
48.6
89.5 23.9
41.8
84.3 21.9
35.8
77.9 20.1
31.6
72.7 18.8
28.5
67.9 17.6
28.0
67.4 17.4
27.5
66.8 17.2
27.1
66.2 17.0
27.0
65.4 16.9
26.4
65.5 16.6
Table B.1: Performance of the flat hybrid recognizer on music lyrics. Ctrain and Crest
are vocabulary coverages (%) on the training and test sets, respectively. IVI is the
vocabulary size. Word & OOV is replacing each graphone sequence in the recognizer's
output by an OOV tag. Word & Graphone is concatenating graphones to spell OOV
words. All error rates are in percentages. This data corresponds to Figures 4-1, 4-2,
and 4-3.
Appendix C
Data Tables: Spoken Lecture
Transcription
Ctrain
Ctest
88.0
90.0
92.0
94.0
96.0
98.0
99.0
100.0
85.6
88.4
90.4
93.0
95.0
97.2
98.4
99.4
{VI
1320
1780
2468
3594
5661
10634
17374
49333
Word-only
Word & OOV
Word & Craphone
WER
SER
LER
WER
SER
WER
SER
LER
52.5
47.5
43.9
39.6
36.8
36.6
32.1
31.1
80.3
78.6
77.7
76.3
75.1
73.4
72.5
72.0
29.9
26.9
24.7
22.1
20.7
18.9
18.1
17.6
41.9
39.8
38.2
36.3
34.7
33.0
32.1
-
80.7
79.1
78.2
76.7
75.5
73.8
72.8
36.2
35.0
34.3
33.6
32.8
32.2
31.8
75.5
74.5
74.0
73.8
73.5
73.0
72.7
19.3
18.9
18.7
18.4
18.2
18.0
17.8
Table C.1: Performance of the flat hybrid recognizer on transcribing spoken lectures.
Ctrain and Crest are vocabulary coverages (%) on training and test sets, respectively.
IVI is the vocabulary size. Word & OOV is replacing each graphone sequence in the
recognizer's output by an OOV tag. Word & Graphone is concatenating graphones
to spell OOV words. All error rates are in percentages. This data corresponds to
Figures 5-1, 5-2, and 5-3.
Bibliography
[1] M. Akbacak, D. Vergyri, and A. Stolcke. Open-vocabulary spoken term detection
using graphone-based hybrid recognition systems. In Proc. Int. Conf. on Spoken
Language Processing,pages 5240-5243, Las Vegas, NV, USA, 2008.
[2] I. Bazzi. Modelling out-of-vocabulary words for robust speech recognition. Ph.D.
Thesis. MIT Department of Electrical Engineering and Computer Science, 2002.
[3] I. Bazzi and J. Glass. Heterogeneous lexical units for automatic speech recognition: Preliminary investigations. In Proc. Int. Conf. on Acoustics, Speech, and
Signal Processing, Istanbul, Turkey, 2000.
[4] I. Bazzi and J. Glass. Modeling out-of-vocabulary words for robust speech recognition. In Proc. Int. Conf. on Spoken Language Processing, Beijing, China, 2000.
[5] I. Bazzi and J. Glass. Learning units for domain-independent out-of-vocabulary
word modelling. In Proc. European Conf. on Speech Communication and Technology, Aalborg, Denmark, 2001.
[6] M. Bisani and H. Ney. Investigations on joint-multigram models for grapheme-tophoneme conversion. In Proc. Int. Conf. on Spoken Language Processing, pages
105-108, Denver, CO, USA, 2002.
[7] M. Bisani and H. Ney. Multigram-based grapheme-tophoneme conversion for
lvcsr. In Proc. European Conf. on Speech Communication and Technology, pages
933-936, Geneva, Switzerland, 2003.
[8] M. Bisani and H. Ney. Open vocabulary speech recognition with flat hybrid
models. In Proc. European Conf. on Speech Communication and Technology,
pages 725-728, Lisbon, Portugal, 2005.
[9] M. Bisani and H. Ney. Joint-sequence models for grapheme-to-phoneme conversion. Speech Communication, 50(5):434-451, 2008.
[10] L. Burget, P. Schwarz, P. Matajka, M. Hannemann, A. Rastrow, C. White,
S. Khudanpur, H. Hermansky, and J. Cernock. Combination of strongly and
weekly constrained recognizers for reliable detection of OOVs. In Proc. Int. Conf.
on Acoustics, Speech, and Signal Processing, pages 4081-4084, Las Vegas, NV,
USA, 2008.
[11] S. F. Chen. Conditional and joing models for grapheme-to-phoneme converstion.
In Proc. European Conf. on Speech Communication and Technology, pages 20332036, Geneva, Switzerland, 2003.
[12] G. Choueiter. Linguistically-Motivated Sub-word Modeling with Applications to
Speech Recognition. Ph.D. Thesis. MIT Department of Electrical Engineering
and Computer Science, 2009.
[13] G. Choueiter, S. Seneff, and J. Glass. Automatic lexical pronunciations generation and update. In Proc. Workshop on Automatic Speech Recognition and
Understanding,pages 225-230, Kyoto, Japan, 2007.
[14] G. Chung. A three-stage solution for flexible vocabulary speech understanding.
In Proc. Int. Conf. on Spoken Language Processing, Beijing, China, 2000.
[15] W. Daelemans and A. van den Bosch. Language-independent data-oriented
grapheme-to-phoneme conversion. In Progress in Speech Processing,pages 77-89,
1997.
[16] S. Deligne and F. Bimbot. Inference of variable-length linguistic and acoustic
units by multigrams. Speech Communication, 23:223-247, 1997.
[17] S. Deligne, F. Yvon, and F. Bimbot. Variable-length sequence matching for
phonetic transcription using joint multigrams. In Proc. European Conf. on Speech
Communication and Technology, pages 2243-2246, Madrid, Spain, 1995.
[18] H. Elovitz, R. Johnson, A. McHugh, and J. Shore. Letter-to-sound rules for
automatic translation of english text to phonetics. IEEE Trans. on Acoustics,
Speech and Signal Processing, 24:446-459, 1976.
[19] L. Galescu. Recognition of out-of-vocabulary words with sub-lexical language
models. In European Conf. on Speech Communication and Technology, pages
249-252, Geneva, Switzerland, 2003.
[20] M. Gerosa and M. Federico. Coping with out-fo-vocabulary words: open vursus huge vocabulary asr. In Proc. Int. Conf. on Acoustics, Speech, and Signal
Processing, pages 4313-4316, Taipei, Taiwan, 2009.
[21] J. Glass. A probabilistic framework for segment-based speech recognition. Computer Speech and Language, 17(2-3):137-152, 2003.
[22] J. Glass, T. J. Hazen, S. Cyphers, I. Malioutov, D. Huynh, and R. Barzilay. Recent progress in the MIT spoken lecture processing project. In Proc. Interspeech,
pages 2553-2556, Antwerp, Belgium, 2007.
[23] T. Hazen and I. Bazzi. A comparison and combination of methods for OOV
word detection and word confidence scoring. In Proc. Int. Conf. on Acoustics,
Speech, and Signal Processing,Salt Lake City, UT, USA, 2001.
[24] T. Hazen, T. Burianek, J. Polifroni, and S. Seneff. Recognition confidence scoring
for use in speech understanding systems. In Computer Speech and Language,
pages 49-67, 2000.
[25] I. Hetherington. A characterization of the problem of new, out-of-vocabulary
words in continuous-speech recognition and understanding. Ph.D. Thesis. MIT
Department of Electrical Engineering and Computer Science, 1995.
[26] L. Hetherington. The MIT finite-state transducer toolkit for speech and language
processing. In Int. Conf on Spoken Language Processing,Jeju, South Korea, 2004.
[27] S. Jiampojamarn, C. Cherry, and G. Kondrak. Joint processing and discriminative training for letter-to-phoneme conversion. In Proc. Workshop on Human
Language Technology, pages 905-913, Columbus, OH, USA, 2008.
[28] T. Kemp and T. Schaaf. Estimating confidence using word lattices. In Proc. European Conf. on Speech Communication and Technology, pages 827-830, Rhodes,
Greece, 1997.
[29] H. Lin, J. Bilmes, D. Vergyri, and K. Kirchhoff. OOV detection by joint
word/phone lattice alignment. In Proc. Workshop on Automatic Speech Recognition and Understanding,pages 478-483, 2007.
[30] C. Martins, A. Teixeira, and J. Neto. Vocabulary selection for a broadcast news
transcription system using morpho-syntatic approach. In Proc. Int. Conf. on
Spoken Language Processing, Antwerp, Belgium, 2007.
[31] H. M. Meng, S. Seneff, and V. W. Zue. Phonological parsing for bi-directional
letter-to-sound/sound-to-letter generation. In Proc. Workshop on Human Language Technology, pages 289-294, Morristown, NJ, USA, 1994. Association for
Computational Linguistics.
[32] S. Oger, G. Linars, F. B6chet, and P. Nocera. On-demand new word learning
using world wide web. In Proc. Int. Conf. on Acoustics, Speech, and Signal
Processing,pages 4305-4308, Las Vegas, NV, USA, 2008.
[33] A. Rastrow, A. Sethy, and B. Ramabhadran. A new method for OOV detection
using hybrid word/fragment system. In Proc. Int. Conf. on Acoustics, Speech,
and Signal Processing,pages 3953-3956, Taipei, Taiwan, 2009.
[34] S. Seneff. The use of subword linguistic modeling for multiple tasks in speech
recognition. Speech Communication, 42(3-4):373-390, 2004.
[35] S. Seneff. Reversible sound-to-letter/letter-to-sound modeling based on syllabic
structure. In Proc. Workshop on Human Language Technology, pages 153-156,
Rochester, NY, 2007.
[36] A. Stolcke. SRILM - An extensible language modeling toolkit. In Proc. Int. Conf
on Spoken Language Processing, Denver, CO, USA, 2002.
[37] P. Taylor. Hidden markov models for grapheme to phoneme conversion. In Proc.
European Conf. on Speech Communication and Technology, Lisbon, Portugal,
2005.
[38] K. Torkkola. An efficient way to learn english grapheme-to-phoneme rules automatically. In Proc. Int. Conf. on Acoustics, Speech, and Signal Processing,pages
199-202, Martigny, Switzerland, 1993.
[39] K. Vertanen. Combining open vocabulary recognition and word confusion networks. In Proc. Int. Conf. on Spoken Language Processing,pages 4325-4328, Las
Vegas, NV, USA, 2008.
[40] P. Vozila, J. Adams, Y. Lobacheva, and R. Thomas. Grapheme to phoneme conversion and dictionary verification using graphonemes. In Proc. European Conf.
on Speech Communication and Technology, pages 2469-2472, Geneva, Switzerland, 2003.
[41] F. Wessel, R. Schliiter, K. Macherey, and H. Ney. Confidence measures for large
vocabulary continuous speech recognition. IEEE Transactions on Speech and
Audio Processing,9(3):288-298, 2001.
[42] C. White, G. Zweig, L. Burget, P. Schwarz, and H. Hermansky. Confidence
estimation, OOV detection and language ID using phone-to-word transduction
and phone-level alignment. In Proc. Int. Conf. on Acoustics, Speech, and Signal
Processing, pages 4085-4088, Las Vegas, NV, USA, 2008.
[43] V. Zue, S. Seneff, J. Glass, J. Polifroni, C. Pao, T. Hazen, and L. Hetherington.
JUPITER: A telephone-based conversational interface for weather information.
IEEE Trans. on Speech and Audio Processing, 8(1):100-112, 2000.
Download