Acoustic model training - Informedia Digital Video Library

advertisement
TIVA
Learning to Recognize Speech by Watching
Television
Photina Jaeyun Jang and Alexander G. Hauptmann, Carnegie Mellon University
Because intelligent systems should be able to automatically learn by watching examples, the
authors present a way to use television broadcasts with closed-captioned text as a source for large
amounts of transcribed speech-training data.
Imagine a computer plugged into your television at home or in a foreign hotel. In the morning, it
can barely transcribe one out of two words correctly, but by evening, it can provide a largely
correct transcript of the evening news show, based on what it learned during the day from the TV.
This seems like the core of familiar science fiction shows, but the research described here brings
this vision closer to reality by exploiting the multimodal nature of the broadcast signals and the
variety of sources available for broadcast data.
Our proposed technique gathers large amounts of speech from open broadcast sources and
combines it with automatically obtained text or closed captioning to identify suitable speechtraining material. George Zavaliagkos and Thomas Colthurst worked on a different approach to
this method that uses confidence scoring on the acoustic data itself to improve performance in the
absence of any transcribed data, but their approach only yielded marginal results.1 Our initial
efforts also provided only limited success with small amounts of data.2
In this article, we describe our approach to collecting almost unlimited amounts of accurately
transcribed speech data. This information serves as training data for the acoustic-model
component of most high-accuracy speaker-independent speech-recognition systems. The errorridden closed-captioned text aligns with the similarly error-ridden speech-recognizer output. We
assume matching segments of sufficient length are reliable transcriptions of the corresponding
speech. We then use these segments as the training data for an improved speech recognizer.
Problems with speech recognition
A speech-recognition system generally uses three components:3 an acoustic model, a language
model, and a lexicon of pronunciations. Here, we only discuss using broadcast news data from
TV sources to improve acoustic models by combining them with imperfect closed captions.
Although the closed captions can also be used to build a language model, this process is
straightforward and will not be discussed here.4
Artificial neural networks, hidden Markov models (HMM), or hybrid systems of both are the
dominant acoustic models of speech-recognition systems. Researchers have extensively tuned the
architecture of these acoustic models, so achieving large recognition improvements is difficult.
Ideally, a good acoustic model would be sufficient for a complete speech-recognition system. But
modeling issues outside the acoustic domain such as language modeling, confidence annotation,
robust signal processing, and postprocessing have been necessary to aid acoustic models—they
also play an increasingly important role for better speech recognition. Because the architecture of
these acoustic models is fairly optimized, but the models can still improve through training on
large amounts of speech data, emphasis has been on manually transcribed speech-data collection.
This transcribed data provides a training corpus for the acoustic model. With sufficient coverage
in the training corpus, which ideally includes typical instances of all representative speech
patterns that might be encountered during recognition, the acoustic model should cope with any
general speech patterns encountered in the future. Thus, it is important to have as much
representative transcribed speech-training data as possible to train the acoustic models, to
maximize the speech-recognition system’s performance.5
Besides extensive speech-data collection, we face the bottleneck of manually annotating or
transcribing the collected speech corpora. Supervised learning systems such as neural networks or
HMM systems require a highly accurate transcription of the training data. Speech-data
transcription relies heavily on humans listening to the speech and manually writing down what
was said, one sentence at a time. This manual transcription is very tedious, expensive, and still
subject to human error. The largest currently available collection of accurately transcribed
broadcast data consists of slightly more than 240 hours of broadcast news provided by the
Linguistic Data Consortium (http://ldc.upenn.edu), but this data is still not error-free.
We present a method for solving both the problems of expensive speech-data collection and
expensive human annotation of speech. Every day new speech data becomes available on
television, along with human-transcribed closed captions. The challenge is to use this unreliable
data for speech-recognizer training.
Sources of speech and preliminary transcriptions
The Informedia Digital Video Library records about an hour and a half of TV broadcasts every
day and stores them in MPEG format (see the “Informedia Digital Video Library” sidebar for
more information). If closed captioning is available, the closed-captioned data is stored in a
separate text file. To date, the library has over 1,700 hours of video. Although we could access
broadcast TV directly 24 hours a day to extract our data, we restricted ourselves to a portion of
the video archive in the Informedia Digital Video Library for simplicity and repeatability. For the
experiments reported here, we used 709 hours of initial raw data from the library.
For comparison, we also examined 100 hours of manually transcribed training data provided by
the Linguistic Data Consortium, which we used as a baseline contrast as well as seed data in our
experiments. After eliminating silences and untranscribed sections, we reduced the LDC Hub4
data to 66 hours of transcribed speech.
Extracting audio samples
From each stored MPEG file, we extract the MPEG audio stream, uncompress it into its
original 44.1-kHz 16-bit sampling rate, and downsample it to 16 kHz. The audio is further
processed through wave formats into standard mel-scale frequency coefficients to represent the
audio signal as feature vectors of 13 values every 10 milliseconds.
We also obtain two sources of transcriptions for the speech: closed-captioned transcripts and
speech-recognizer-generated transcripts.
Closed-caption transcripts
Closed captions from television news programs are typed by human transcribers and broadcast
together with the video. While many programs are not captioned at all, for prime-time broadcast
news stories, the closed-captioned data contains an average 17% word-error rate (WER), when
the captioned are carefully compared to what was actually spoken. This captioning increases
when you consider that large chunks of a broadcast are frequently not captioned when
commercials or previews of other programs appear. In addition to spelling errors, insertions, and
omissions, the captions are usually absent if text is visible on the screen. Occasionally the human
captioner will also fall behind and rephrase the speech to summarize what was spoken to catch up
with the real-time captioning. In addition, numbers and dates are often transcribed ambiguously
(for example, given the transcription 137.5, the speech could have been “one hundred thirty-seven
point five”, “one thirty-seven point five”, “one hundred thirty-seven and a half”, and so on).
A high WER in the transcription will likely corrupt the existing acoustic model, decreasing
rather than increasing the speech-recognition system’s accuracy. To avoid this, we want the
transcriptions to be as accurate as possible, more so than existing closed-caption transcripts.
Speech-recognizer hypotheses
A second independent source of transcription data is the initial output of a speech recognizer.
We use the lower-accuracy, but faster, Sphinx-II speech-recognition system to get a quick and
rough speech-recognizer transcript of the video data.6 Because a lot of data has to be processed,
recognizer speed is more important than quality in this pass. Using only the output of a speech
recognizer for transcribing speech data is inadequate, because these transcripts are full of errors.
The initial speech recognitions we used on this broadcast news domain data range from
approximately 30% to 60% WER. We also want to avoid training the speech recognizer to repeat
the errors it initially made based on its original acoustic model.
Extracting accurately transcribed speech segments
To obtain a more accurate transcription of an audio stream, we first align the closed captions
and the speech recognizer’s output. The process of finding matching sequences of words in the
closed captions and the recognition output is rather straightforward: we perform a dynamicprogramming alignment between the two text strings, with a distance metric between words that
is zero if they match exactly, one if they don’t.7
From the alignment, we select segments where sequences of three or more words are the same
in both the recognizer output and closed captions. Figure 1 shows a selection of the words “I
think there are” from the audio stream, using the associated speech recognition and closed
captions. This selection verifies the closed captions using a source independent of them namely,
the speech-recognition output. We view the matching word sequence of closed captions and the
speech recognizer’s hypothesis as a form of mutual confirmation or as a binary confidence
measure. We extract the corresponding audio segment for the selected annotation segment to use
as speech sample training data. Once this method has found corresponding, matching sections,
extracting the corresponding speech signal, captioning text from their respective files, and adding
them to the training data set is relatively simple.
Figure 1. Extracting an audio segment with a reliable transcription from the MPEG video using
the closed captions aligned with a rough speech-recognition transcript.
The word sequences function as the transcripts of the training corpus for supervised learning in
any training-data-dependent speech-recognition system, such as HMM or neural-network-based
systems.
Because the errors made by the captioning service and by Sphinx are largely independent, we
can be confident that extended sections over which the captions and the Sphinx transcript
correspond have been (mostly) correctly transcribed. A human evaluated the quality of the
training segments obtained this way, and he judged them to be correct with a WER of 8.6%. Our
initial training corpus comprised 709.8 hours of video, with associated closed captions. From this
training corpus, we extracted 18.6% (131.4 hours) of the data as usable transcriptions.
Acoustic model training
The acoustic model training proceeds in three steps: forced alignment, codebook generation
with cluster/gather, and model reestimation using the Baum-Welch algorithm. Our Sphinx-III
speech-recognition system requires the acoustic models to be fully continuous HMMs.
To help explain the process of training HMM-based acoustic models, Figure 2 shows how
words are translated into phoneme sequences.3 A five-state HMM with predefined transitions
represents each phoneme. A 10-millisecond frame of observations of the speech signal triggers a
state transition. We compute the vector of observations at each frame. 39 features are derived
from the raw audio signal sampled at 16 bits with 16000 samples per second. We convert the raw
samples into 12 mel-scale frequency cepstral values per frame. The cepstral values, differenced
cepstral values and the difference of the differenced cepstral values together with the power
(energy of the signal) are represented as a vector of 39 observed features at each 10-millisecond
frame. Each HMM state is modeled through a mixture of 20 Gaussian densities (means and
variances). During training, we derive the Gaussian density means and variances that best model
each state. Because each triphone (a phoneme considered in the context of its left and right
phoneme neighbor) has five states, there are more possible states than we can reasonably train.
We limit the number of states (senones) to a total of 6,000, where different senones from different
phonemes can be tied together and modeled as only one of the 6,000 states or senones, if they are
sufficiently similar.
Figure 2. The hierarchy of HMM-based acoustic-model-training units.
The acoustic-model-training procedure is identical to the configuration of the official 1997
Hub4 DARPA evaluation.8 We use the same 6,000 senonically tied states as the Hub4cmu model.
Each of these tied states consists of a mixture of 20 Gaussian densities. The senone trees are
automatically constructed from the F0, F1 subsets of the Hub4 1997 training data; that is, we
derive the senone trees from the clean studio speech and clean dialog speech data only.
We train the acoustic model on context-dependent phonemes (triphones). We map these into
6,000 senones and train them with our training segments. Because the Hub4 evaluation uses
separate models for telephone bandwidth speech (F2), we exclude this condition from our Hub4
training data and from our evaluation.
Forced alignment
Forced alignment serves two functions in preparing the data for the estimation of the Gaussian
densities. First, filled pauses are explicitly modeled as phonemes to prevent them from corrupting
the training of actual phonemes. Second, the forced-alignment stage segments the speech data
into state segments to align for the later Baum-Welch model reestimation stage. The forced
alignment thus assigns each feature vector in the training data segment to a single state in the
HMM.
We force-align the transcript with its speech data to obtain a transcript that reflects the speech
data by inserting noise phones, filler words, or filled pauses between the content words. The
beginning and end of a training utterance is tokenized with a beginning-of-sentence silence <s>
and an end-of-sentence silence </s>. Interword silences are inserted as <sil>. In addition to
silence phones, noise phones such as +INHALE+, +SMACK+, +EH+, +UH+, or +UM+ are
inserted. Figure 3 shows an example of an original transcript and the insertion of nonspeech
sounds in the force-aligned transcript.
original transcript: I THINK THERE ARE
force-aligned transcript: <s> I ++SMACK++ THINK <sil> THERE ARE ++INHALE++ </s>
Figure 3. A transcript before and after forced alignment. The nonspeech sounds <sil> (silence),
++SMACK++ (indicating a lip smack), and ++INHALE++ (indicating a breath noise) are
inserted in the original transcript.
Forced alignment also has a (limited) capability to reject training utterances where the feature
vectors cannot be adequately aligned to a sequence of phonetic states. This would happen, for
example, if the number of phonemes in the training transcript is greater than the number of
phonemes actually contained in the audio feature vector or if the acoustic models completely
mismatch the feature vectors.
Because of this rejection, the amount of training data from the baseline LDC Hub4 data
decreased from 66 hours to 62.8 hours. The automatically extracted CCtrain data decreased from
131.4 hours to 111.5 hours. The data yield from the original 709.8 hours was therefore
approximately 15.6%. All together, the amount of usable data totaled 174.3 hours available for
further training.
Codebook generation
This process takes two steps: gathering the feature vectors together and clustering them into
different densities.
During the gathering step, we aggregate all the feature vectors of all the frames in the training
data. Each feature vector covers a 20-millisecond window with 10 milliseconds overlap, and
consists of 39 features made up of 12 mel-scale frequency cepstral feature coefficients, 12 delta
(differenced) mel-scale cepstral coefficients, 12 delta delta (differenced differences) mel-scale
cepstral coefficients, and three corresponding power features.
The clustering step uses K-means clustering to partition the training vectors into different
Gaussian densities. Each density is represented by its mean and variance in the 39-dimensional
vector spaces.
Model reestimation
As is typical for HMM-based speech-recognition systems, we use the Baum-Welch algorithm
(also known as the forward-backward algorithm) to model the observations in the training data
through the HMM parameters. This algorithm is a kind of EM (expectation maximization)
algorithm that iterates through the data first in a forward pass and then in a backward pass.
During each pass, we adjust a set of probabilities to maximize the probability of a given
observation in the training data corresponding to a given HMM state. Because this estimation
problem has no analytical solution, incremental iterations are necessary until a convergence is
achieved. In each iteration the algorithm tries to find better probabilities that maximize the
likelihood of observations and training data. During this phase, we reestimate the mixing weight,
transition probabilities, and mean and variance parameters.
After each Baum-Welch reestimation iteration, we insert a normalization step. We compute the
reestimated model parameters from the reestimation counts obtained through Baum-Welch. This
combined Baum-Welch and normalization iteration repeats until we achieve an acceptable
parameter convergence.
Speech decoding
We decode speech with a Viterbi search, which finds the state sequence that has the highest
probability of being taken while observing the features of the speech sequence. Because this
experiment’s purpose was to evaluate the new acoustic model’s effectiveness, we perform only a
single search pass.
To get optimal performance from the system, we can use multiple passes where we perform
maximum-likelihood linear regression between passes for speaker adaptation. The Hub4
evaluation system consists of three such decoding passes with acoustic adaptation steps between
each pass. For each pass, we apply an A* beam search for the Viterbi decoding pass and a
directed acyclic graph (DAG) search for the best path pass of the Viterbi word lattice. Finally, we
generate and rescore N-best hypotheses. Our baseline CMU Hub4 system has an overall WER of
24% when all passes are applied.8 Because the test data in our experiments comes from a
different set, and because we do not perform multiple adaptation passes, the Hub4 CMU SphinxIII system’s baseline WER is 32.82%.
Language model
In addition to the acoustic model, the decoding system uses a language model that reflects the
likelihood of different word sequences. We used a Good-Turing discounted trigram back-off
language model. It is trained on broadcast news language model-training data and built using a
64,000-word vocabulary.8
By increasing the language model’s contribution relative to the acoustic model, we make the
system less sensitive to acoustic errors and more sensitive to word-transition probabilities. To get
increased benefit from the language model, we can apply an additionally N-best rescoring pass. In
general, the recognition accuracy is highest when decoding with N-best rescoring followed by
DAG best-path search and then the single 1-best Viterbi decoding. Multipass decoding might
provide higher accuracy with contributions by a better search through the hypothesis lattice and
with smoothed language models.
Evaluation results
In the following experiments, we decoded our test speech data with only a single-pass Viterbi
search and without maximum-likelihood linear-regression acoustic adaptation. We did this
because we intended to estimate the improvement achieved on the acoustic model compared to
the baseline model, where single-pass decoding is sufficiently indicative.
Consistent with standard practice in speech evaluations, the recognition word error is not
computed simply as the number of miscrecognized words; it also includes deletion and
substitution. The error rate is the error normalized by the number of words in the test set. So,
word error = insertion + deletion + substitution, and WER = word error  total number of words
in the test set.
The test data for our results came from the 1996 DARPA Hub4 broadcast news development
data (Dev’96 test set).9 It consists of 409 utterances, which is equal to 16,456 words with a total
duration of 1.5 hour. Figure 4 details the Dev’96 test set.
F0 : Baseline broadcast speech
F1 : Spontaneous broadcast speech
F2 : Speech over telephone channels
F3 : Speech in the presence of background music
F4 : Speech under degraded acoustic conditions
F5 : Speech from nonnative speakers
Fx : All other speech
Figure 4. Hub4 acoustic conditions. The 7 conditions (F0, F1, F2, F3, F4, F5, Fx) represent
different kinds of speech in different acoustic environments.
We use different acoustic models suitable for reduced bandwidth speech for the telephone (F2)
conditions. We decided to exclude the F2 condition and only evaluate the nontelephone (F0, F1,
F3, F4, F5) conditions because we are improving a full-bandwidth nontelephone mixture
Gaussian model, not training a narrow-bandwidth telephone mixture Gaussian model.
The different acoustic categories in our Dev’96 test set were not evenly distributed. The
number of words in each category varied widely—out of 16,456 words, the F0 condition
contained 4,437 words; F1, 5,706; F3, 1,809; F4, 2,697; and F5, 1,807.
Figure 5 shows the overall result statistics (including all acoustic conditions together) on the
Dev’96 test set. This test set includes both the baseline Hub4cmu system and our improved
system (CCtrain named for the fact the we TRAIN with CC (closed-captioned) data), which we
built by estimating the acoustic model parameters from both the 62.8 hours of Hub4 training data
and the 111.5 hours of our automatically derived training corpus. We based this corpus on closedcaption data and initial recognition transcripts. For all three different language-model weights (3,
6, and 9), our CCtrain acoustic model achieved improvements over our baseline system; the
model used a total of 174.3 hours of training data.
Figure 5. Recognition word-error rate for the Hub4cmu-trained acoustic baseline models
compared to the CCtrain models for the Dev’96 test set, with three different language-model
weights.
Figure 6 depicts the absolute percentage decrease, again with language-model weights, this
time of 6, 9, and 13.
Figure 6. The absolute difference in word-error rate between the CCtrain system (62.8 hours of
manual transcripts + 111.5 hours of automatically extracted transcripts = 174.3 hours) and the
Hub4cmu baseline system (62.8 hours of manual transcripts only) using three different languagemodel weights.
Higher language-model weights make it harder to estimate the degree of acoustic-model
improvement. This is due to the language model’s increasing contribution to speech-recognition
accuracy during decoding. Figures 5 and 6 show that adding our CCtrain corpus to the manually
transcribed Hub4 training data improves the overall speech-recognition accuracy.
Figure 7 shows the WER for the individual acoustic conditions, contrasting the baseline
Hub4cmu system with our CCtrain adapted system. The CCtrain model provides superior results
on F0, F3, F4, and F5. For F1, its performance is comparable. The different acoustic conditions
result in widely varying WERs, reflecting the difficulty of the condition for speech recognition as
well as the varying amounts of training data in each condition. For this and all subsequent figures,
we set the language-model weight to six to demonstrate the acoustic component’s effect, with the
language model providing relatively little influence.
Figure 7. CCtrain versus the Hub4cmu baseline recognition word-error rate for each acoustic
condition: F0 = clean broadcast speech, F1 = spontaneous speech, F3 = speech with background
music, F4 = speech under degraded acoustic conditions, and F5 = speech from nonnative
speakers.
To illustrate the transcription-extraction approach’s power, we built another acoustic model
exclusively from the extracted training data (CCtrain-only), without any of the manually
transcribed Hub4 training data. To emphasize the difference between human-transcribed data and
automatically extracted data, we also limited the training to 70.7 hours of transcribed data. Thus,
the CCtrain-only automatically extracted data condition is comparable to the Hub4-only
condition. Figure 8 shows that the CCtrain-only condition performs slightly better than the Hub4only condition. We also found that an expanded CCtrain model (CCtrain + Hub4) outperforms
our baseline Hub4cmu system. In the expanded CCtrain condition we are merely increasing the
amount of training data to 174.3 hours by adding automatically extracted data to the manually
transcribed Hub4 training data without any other adjustments.
Figure 8. Word-error rate of CCtrain built from 174.3 hours of data consisting of our extracted
segments plus Hub4 data, Hub4cmu built from 62.8 hours of Hub4 data, and CCtrain built from
70.7 hours of extracted segments only of training data. These word-error rates represent each
acoustic condition: F0 = clean broadcast speech, F1 = spontaneous speech, F3 = speech with
background music, F4 = speech under degraded acoustic conditions, and F5 = speech from
nonnative speakers.
Error analysis
To analyze the effect of the additional training data in CCtrain on the recognition performance,
we examined the word errors in the test set with respect to the number of training instances in the
Hub4 and the CCtrain training data.
In Figure 9, we only examined words that were present in both the training corpus as well as
the Dev’96 test data. The x-axis represents the number of times a word was present in the training
set, and the y-axis is the number of errors for that same word in the Dev’96 test set. Each word is
represented by two points with a line connecting them. The first point represents the number of
training instances for the word in the Hub4 training corpus and the number of errors for this word
during the Hub4cmu baseline recognition. The second point is based on the CCtrain corpus. The
CCtrain point is determined by the number of training instances in the CCtrain corpus on the xaxis and the number of errors for that word during the CCtrain decode on the y-axis. Because the
CCtrain training corpus included all the Hub4 words as well as new training material, the CCtrain
point is always on the right end of a line, while the Hub4cmu point is on the left end. We divided
the data into two sets of words, those words where the errors decreased with more training data
and those where the errors increased after adding the additional CCtrain data. The descending
(gray) lines show the recognized words with the fewest errors in the CCtrain model while the
ascending (bold) lines show the words that were recognized less accurately after CCtrain.
Figure 9. The effect of word frequency in the different training corpora on the number of errors.
Each word in the test data is plotted based on its errors in the Hub4cmu baseline recognition
given its frequency during Hub4 training (the left point of each line) compared to the CCtrain
frequency and recognition errors for that word (the right point of each line). An ascending line
shows an increase in the error rate for a word (17.4% of the Dev’96 test-set words), while a
descending line reflects a decrease in error rate by the CCtrain model (82.6% of the Dev’96 testset words).
Figure 10 plots the same data. Instead of plotting both the Hub4cmu baseline word-training
frequency and total errors together with the CCtrain training frequency and errors, we subtract the
CCtrain word errors from the Hub4cmu errors, and we use the difference in training frequencies
between CCtrain and Hub4cmu on the x-axis. Each data point in Figure 10 now represents the
increase in word training frequency for the CCtrain corpus on the x-axis and the difference in
word errors on the y-axis between the two conditions. If the word error does not change, the
difference in errors is 0. If the Hub4cmu word errors were lower than the CCtrain error rate, the
data values in Figure 10 will lie above the 0 point on the y-axis. If the CCtrain showed an
improvement for that word, the point will be plotted below 0 on the y-axis. Examining Figure 10
shows that the word errors usually decrease when we increase the number of training instances by
2,000 or more.
Figure 10. A plot of the difference in errors for each word between CCtrain and Hub4cmu model
recognition in the Dev’96 test data (y-axis). The x-axis shows the increase in training-word
frequency for CCtrain over Hub4cmu for a given word. Increases in word error for CCtrain are
positive on the y-axis (17.4% of all words), while decreases are negative (82.6%).
Figure 11 shows all the words in the test data, plotted as a function of their frequency in the
training data. To show the trend of the individual points, a line represents the average WER in
histogram bins, where each bin combines 30 adjacent frequencies.
Figure 11. The word-error rate in the CCtrain condition for each word in the Dev’96 test set,
plotted as a function of the number of training instances for that word. The trend line shows the
average error rate for words in a histogram bin of 30 similarly frequent words. At about 2,000
words, the number of training instances no longer helps reduce the word-error rate. Up to 2,000
words, each additional training instance of a word helps reduce its error rate.
Because we are training HMMs based on triphones, not words, we also analyzed the behavior
of the triphones with respect to the number of training instances observed. As we described
earlier, a triphone is a context-dependent phoneme where its left and right phoneme context is
taken into account. A context-independent phoneme does not carry its context information. For
example, a phoneme IH in the context of TH IH NG (as in the word something), is a triphone IH
where TH is its left phoneme context and NG is the right phoneme context.
If the corpora to train the acoustic model have little or no noise, we observe words or triphones
in the training set more often. Therefore, we would expect better acoustic models.
Predicting informative triphone subsets
The analysis of our test data indicated that training on about 1,000 instances of a triphone or
2,000 instances of each word would be sufficient to obtain acceptable recognition performance.
To verify this hypothesis, we conducted an experiment with selective training of rare or
underrepresented triphones or words, by excluding redundant training data that had no instances
of triphones below the threshold.
We hypothesized a threshold, set to 1,000, on the triphone frequency where the triphone error
rate started to closely approximate the asymptotic upper bound. We extract a subset of our
automatically transcribed speech training data by adding a new segment to the subset only if the
frequency of any triphone within the segment was below our hypothesized threshold. We trained
an acoustic model on this training subset to see whether the hypothesized threshold near the
asymptote on our plot is a good indicator of the number of triphone observations required.
Figure 12 shows the results of single-pass decodings. Our selected training subset of 66.7
hours, which included only approximately 60% of the original training set of 111.5 hours, yielded
a WER decrease of 6.76% relative (2.22% absolute) to the Sphinx baseline system built from 62.8
hours of Hub4’97 training data. However, our selection technique resulted in a WER increase of
0.25% compared to the complete 111.5 hours of the CCtrain data set.
Figure 12. Word-error rate comparison of models built from three nonselected sets (62.8 hours
Hub4 baseline, 111.5 hours CCtrain, and 174.3 hours combined) and one selected set (66.7 hours
selected from 111.5 hours of CCtrain data).
If we take only 111.5 hours of unselected CCtrain data, we obtain a WER that is 0.8% lower
than when this Cctrain data is combined with the Hub4’97 manually transcribed data yielding a
total of 174 hours. However, one would expect more training data to result in better WER. We
believe this lower WER on less training data is because our automatic extraction technique
produces more accurate transcriptions than those found in the manual Hub4’97 training data. It
might also appear contradictory to first extract an arbitrarily large amount of transcribed speech
and then to reduce that set into a smaller subset, but selecting a subset from the original, larger
amount of speech increases the probability of incorporating more diverse triphones.
Although the training utterances derived from TV broadcasts, rough speech-recognition
transcripts, and closed captions are not completely error-free, a potentially unlimited amount of
this data is freely available from TV broadcasts 24 hours a day. The automated procedures for
deriving accurate transcriptions allow virtually unlimited amounts of data to be processed, limited
only by CPU cycles and storage space. We overcame remaining errors in the training data
transcripts by the large quantity of data.
Selection from collected speech data based on triphone frequency thresholding reduces training
costs, results in better decoding performance than when a similar amount of unselected manually
transcribed speech is used, and performs comparable to much larger amounts of speech training
sets.
One possible criticism of our scheme for learning acoustic-model parameters is that the
approach can only identify sections of speech on which the recognizer already performs well.
However, a dynamically constructed language model based on the captioned words, together with
a high language-model preference in the recognizer’s first pass, can diminish this bias. The
identified sections of speech have sufficient variability to provide useful training data, but
eventually an asymptotic plateau might be reached.
We are also working on improving the quality of training corpora with confidence annotation.
Extracting segments based on confidence would be an alternative to selecting the conjunction of
closed captions and the recognition of a speech system, as Zavaliagkos and Colthurst described.1
Acknowledgments
This article is based on work supported by the National Science Foundation under Cooperative
Agreement IIS-9817496. Any opinions, findings, conclusions, or recommendations expressed in this
material are those of the authors and do not necessarily reflect the views of the NSF.
References
1.
2.
3.
4.
5.
6.
7.
8.
9.
G. Zavaliagkos and T. Colthurst, “Utilizing Untranscribed Training Data to Improve Performance,” Proc. DARPA
Speech Recognition Workshop, 1998, pp. 301305.
M. Witbrock and A.G. Hauptmann, Improving Acoustic Models by Watching Television, Tech. Report CMU-CS98-110, Computer Science Dept., Carnegie Mellon Univ., Pennsylvania, Pa., 1998.
K.F. Lee, H.-W. Hon, and R. Reddy, An Overview of the Sphinx Speech Recognition System: Readings in Speech
Recognition, Morgan Kaufmann, San Francisco, 1990.
A. Rudnicky, “Language Modeling with Limited Domain Data,” Proc. ARPA Workshop on Spoken Language
Technology, 1996.
H. Bourlard, H. Hermansky, and N. Morgan, “Copernicus and then ASR Challenge—Waiting for Kepler,” Proc.
DARPA Speech Recognition Workshop, 1996.
M. Hwang et al., “Improving Speech Recognition Performance via Phone-Dependent VQ Codebooks and
Adaptive Language Models in Sphinx-II,” Proc. Int’l Conf. Acoustics, Speech and Signal Processing, Vol. I,
1994, pp. 549552.
H. Nye, “The Use of a One-Stage Dynamic Programming Algorithm for Connected Word Recognition,” IEEE
Trans. Acoustics, Speech, and Signal Processing, Vol. AASP-32, No. 2, Apr. 1984, pp. 263271.
K. Seymore et al., “The 1997 CMU Sphinx-3 English Broadcast News Transcription System,” Proc. DARPA
Speech Recognition Workshop, 1998.
D. Pallett, “Overview of the 1997 DARPA Speech Recognition Workshop,” Proc. DARPA Speech Recognition
Workshop, 1997.
The Informedia Digital Video Library
Researchers working on the Informedia Digital Video Library Project at Carnegie Mellon
University are creating a digital library of text, image, video, and audio data, whose entire content
can be searched for rapid retrieval of material relevant to a user query. 1 Through the integration
of technologies from natural language understanding, image processing, speech recognition, and
information retrieval, the Informedia Digital Video Library system lets users explore the
multimedia data both in depth and in breadth. Speech recognition is a critical component in the
automated library-creation process.2,3 Users can employ automatic speech recognition to
transcribe the audio portion of the video data stored in MPEG format in the library. This speechrecognizer-generated transcript forms the basis for the main text search and retrieval functions.
The transcripts also let the natural-language component proceed with functions such as
summarizing the library documents, classifying them into topics, translating documents, and
creating descriptive titles.
References
1.
2.
3.
M. Christel et al., “Techniques for the Creation and Exploration of Digital Video Libraries,” Multimedia Tools
and Applications, Vol. 2, Borko Furht, ed., Kluwer Academic Press, Boston, 1996
A.G. Hauptmann and H.D. Wactlar, “Indexing and Search of Multimodal Information,” Proc. Int’l Conf.
Acoustics, Speech and Signal Processing, 1997
M.J. Witbrock and A.G. Hauptmann, “Artificial Intelligence Techniques in a Digital Video Library,” J. Amer.
Soc. for Information Science, 1997.
Download