Acoustic model training - Informedia Digital Video Library

TIVA Learning to Recognize Speech by Watching Television Photina Jaeyun Jang and Alexander G. Hauptmann, Carnegie Mellon University Because intelligent systems should be able to automatically learn by watching examples, the authors present a way to use television broadcasts with closed-captioned text as a source for large amounts of transcribed speech-training data. Imagine a computer plugged into your television at home or in a foreign hotel. In the morning, it can barely transcribe one out of two words correctly, but by evening, it can provide a largely correct transcript of the evening news show, based on what it learned during the day from the TV. This seems like the core of familiar science fiction shows, but the research described here brings this vision closer to reality by exploiting the multimodal nature of the broadcast signals and the variety of sources available for broadcast data. Our proposed technique gathers large amounts of speech from open broadcast sources and combines it with automatically obtained text or closed captioning to identify suitable speechtraining material. George Zavaliagkos and Thomas Colthurst worked on a different approach to this method that uses confidence scoring on the acoustic data itself to improve performance in the absence of any transcribed data, but their approach only yielded marginal results.1 Our initial efforts also provided only limited success with small amounts of data.2 In this article, we describe our approach to collecting almost unlimited amounts of accurately transcribed speech data. This information serves as training data for the acoustic-model component of most high-accuracy speaker-independent speech-recognition systems. The errorridden closed-captioned text aligns with the similarly error-ridden speech-recognizer output. We assume matching segments of sufficient length are reliable transcriptions of the corresponding speech. We then use these segments as the training data for an improved speech recognizer. Problems with speech recognition A speech-recognition system generally uses three components:3 an acoustic model, a language model, and a lexicon of pronunciations. Here, we only discuss using broadcast news data from TV sources to improve acoustic models by combining them with imperfect closed captions. Although the closed captions can also be used to build a language model, this process is straightforward and will not be discussed here.4 Artificial neural networks, hidden Markov models (HMM), or hybrid systems of both are the dominant acoustic models of speech-recognition systems. Researchers have extensively tuned the architecture of these acoustic models, so achieving large recognition improvements is difficult. Ideally, a good acoustic model would be sufficient for a complete speech-recognition system. But modeling issues outside the acoustic domain such as language modeling, confidence annotation, robust signal processing, and postprocessing have been necessary to aid acoustic models—they also play an increasingly important role for better speech recognition. Because the architecture of these acoustic models is fairly optimized, but the models can still improve through training on large amounts of speech data, emphasis has been on manually transcribed speech-data collection. This transcribed data provides a training corpus for the acoustic model. With sufficient coverage in the training corpus, which ideally includes typical instances of all representative speech patterns that might be encountered during recognition, the acoustic model should cope with any general speech patterns encountered in the future. Thus, it is important to have as much representative transcribed speech-training data as possible to train the acoustic models, to maximize the speech-recognition system’s performance.5 Besides extensive speech-data collection, we face the bottleneck of manually annotating or transcribing the collected speech corpora. Supervised learning systems such as neural networks or HMM systems require a highly accurate transcription of the training data. Speech-data transcription relies heavily on humans listening to the speech and manually writing down what was said, one sentence at a time. This manual transcription is very tedious, expensive, and still subject to human error. The largest currently available collection of accurately transcribed broadcast data consists of slightly more than 240 hours of broadcast news provided by the Linguistic Data Consortium (http://ldc.upenn.edu), but this data is still not error-free. We present a method for solving both the problems of expensive speech-data collection and expensive human annotation of speech. Every day new speech data becomes available on television, along with human-transcribed closed captions. The challenge is to use this unreliable data for speech-recognizer training. Sources of speech and preliminary transcriptions The Informedia Digital Video Library records about an hour and a half of TV broadcasts every day and stores them in MPEG format (see the “Informedia Digital Video Library” sidebar for more information). If closed captioning is available, the closed-captioned data is stored in a separate text file. To date, the library has over 1,700 hours of video. Although we could access broadcast TV directly 24 hours a day to extract our data, we restricted ourselves to a portion of the video archive in the Informedia Digital Video Library for simplicity and repeatability. For the experiments reported here, we used 709 hours of initial raw data from the library. For comparison, we also examined 100 hours of manually transcribed training data provided by the Linguistic Data Consortium, which we used as a baseline contrast as well as seed data in our experiments. After eliminating silences and untranscribed sections, we reduced the LDC Hub4 data to 66 hours of transcribed speech. Extracting audio samples From each stored MPEG file, we extract the MPEG audio stream, uncompress it into its original 44.1-kHz 16-bit sampling rate, and downsample it to 16 kHz. The audio is further processed through wave formats into standard mel-scale frequency coefficients to represent the audio signal as feature vectors of 13 values every 10 milliseconds. We also obtain two sources of transcriptions for the speech: closed-captioned transcripts and speech-recognizer-generated transcripts. Closed-caption transcripts Closed captions from television news programs are typed by human transcribers and broadcast together with the video. While many programs are not captioned at all, for prime-time broadcast news stories, the closed-captioned data contains an average 17% word-error rate (WER), when the captioned are carefully compared to what was actually spoken. This captioning increases when you consider that large chunks of a broadcast are frequently not captioned when commercials or previews of other programs appear. In addition to spelling errors, insertions, and omissions, the captions are usually absent if text is visible on the screen. Occasionally the human captioner will also fall behind and rephrase the speech to summarize what was spoken to catch up with the real-time captioning. In addition, numbers and dates are often transcribed ambiguously (for example, given the transcription 137.5, the speech could have been “one hundred thirty-seven point five”, “one thirty-seven point five”, “one hundred thirty-seven and a half”, and so on). A high WER in the transcription will likely corrupt the existing acoustic model, decreasing rather than increasing the speech-recognition system’s accuracy. To avoid this, we want the transcriptions to be as accurate as possible, more so than existing closed-caption transcripts. Speech-recognizer hypotheses A second independent source of transcription data is the initial output of a speech recognizer. We use the lower-accuracy, but faster, Sphinx-II speech-recognition system to get a quick and rough speech-recognizer transcript of the video data.6 Because a lot of data has to be processed, recognizer speed is more important than quality in this pass. Using only the output of a speech recognizer for transcribing speech data is inadequate, because these transcripts are full of errors. The initial speech recognitions we used on this broadcast news domain data range from approximately 30% to 60% WER. We also want to avoid training the speech recognizer to repeat the errors it initially made based on its original acoustic model. Extracting accurately transcribed speech segments To obtain a more accurate transcription of an audio stream, we first align the closed captions and the speech recognizer’s output. The process of finding matching sequences of words in the closed captions and the recognition output is rather straightforward: we perform a dynamicprogramming alignment between the two text strings, with a distance metric between words that is zero if they match exactly, one if they don’t.7 From the alignment, we select segments where sequences of three or more words are the same in both the recognizer output and closed captions. Figure 1 shows a selection of the words “I think there are” from the audio stream, using the associated speech recognition and closed captions. This selection verifies the closed captions using a source independent of them namely, the speech-recognition output. We view the matching word sequence of closed captions and the speech recognizer’s hypothesis as a form of mutual confirmation or as a binary confidence measure. We extract the corresponding audio segment for the selected annotation segment to use as speech sample training data. Once this method has found corresponding, matching sections, extracting the corresponding speech signal, captioning text from their respective files, and adding them to the training data set is relatively simple. Figure 1. Extracting an audio segment with a reliable transcription from the MPEG video using the closed captions aligned with a rough speech-recognition transcript. The word sequences function as the transcripts of the training corpus for supervised learning in any training-data-dependent speech-recognition system, such as HMM or neural-network-based systems. Because the errors made by the captioning service and by Sphinx are largely independent, we can be confident that extended sections over which the captions and the Sphinx transcript correspond have been (mostly) correctly transcribed. A human evaluated the quality of the training segments obtained this way, and he judged them to be correct with a WER of 8.6%. Our initial training corpus comprised 709.8 hours of video, with associated closed captions. From this training corpus, we extracted 18.6% (131.4 hours) of the data as usable transcriptions. Acoustic model training The acoustic model training proceeds in three steps: forced alignment, codebook generation with cluster/gather, and model reestimation using the Baum-Welch algorithm. Our Sphinx-III speech-recognition system requires the acoustic models to be fully continuous HMMs. To help explain the process of training HMM-based acoustic models, Figure 2 shows how words are translated into phoneme sequences.3 A five-state HMM with predefined transitions represents each phoneme. A 10-millisecond frame of observations of the speech signal triggers a state transition. We compute the vector of observations at each frame. 39 features are derived from the raw audio signal sampled at 16 bits with 16000 samples per second. We convert the raw samples into 12 mel-scale frequency cepstral values per frame. The cepstral values, differenced cepstral values and the difference of the differenced cepstral values together with the power (energy of the signal) are represented as a vector of 39 observed features at each 10-millisecond frame. Each HMM state is modeled through a mixture of 20 Gaussian densities (means and variances). During training, we derive the Gaussian density means and variances that best model each state. Because each triphone (a phoneme considered in the context of its left and right phoneme neighbor) has five states, there are more possible states than we can reasonably train. We limit the number of states (senones) to a total of 6,000, where different senones from different phonemes can be tied together and modeled as only one of the 6,000 states or senones, if they are sufficiently similar. Figure 2. The hierarchy of HMM-based acoustic-model-training units. The acoustic-model-training procedure is identical to the configuration of the official 1997 Hub4 DARPA evaluation.8 We use the same 6,000 senonically tied states as the Hub4cmu model. Each of these tied states consists of a mixture of 20 Gaussian densities. The senone trees are automatically constructed from the F0, F1 subsets of the Hub4 1997 training data; that is, we derive the senone trees from the clean studio speech and clean dialog speech data only. We train the acoustic model on context-dependent phonemes (triphones). We map these into 6,000 senones and train them with our training segments. Because the Hub4 evaluation uses separate models for telephone bandwidth speech (F2), we exclude this condition from our Hub4 training data and from our evaluation. Forced alignment Forced alignment serves two functions in preparing the data for the estimation of the Gaussian densities. First, filled pauses are explicitly modeled as phonemes to prevent them from corrupting the training of actual phonemes. Second, the forced-alignment stage segments the speech data into state segments to align for the later Baum-Welch model reestimation stage. The forced alignment thus assigns each feature vector in the training data segment to a single state in the HMM. We force-align the transcript with its speech data to obtain a transcript that reflects the speech data by inserting noise phones, filler words, or filled pauses between the content words. The beginning and end of a training utterance is tokenized with a beginning-of-sentence silence <s> and an end-of-sentence silence </s>. Interword silences are inserted as <sil>. In addition to silence phones, noise phones such as +INHALE+, +SMACK+, +EH+, +UH+, or +UM+ are inserted. Figure 3 shows an example of an original transcript and the insertion of nonspeech sounds in the force-aligned transcript. original transcript: I THINK THERE ARE force-aligned transcript: <s> I ++SMACK++ THINK <sil> THERE ARE ++INHALE++ </s> Figure 3. A transcript before and after forced alignment. The nonspeech sounds <sil> (silence), ++SMACK++ (indicating a lip smack), and ++INHALE++ (indicating a breath noise) are inserted in the original transcript. Forced alignment also has a (limited) capability to reject training utterances where the feature vectors cannot be adequately aligned to a sequence of phonetic states. This would happen, for example, if the number of phonemes in the training transcript is greater than the number of phonemes actually contained in the audio feature vector or if the acoustic models completely mismatch the feature vectors. Because of this rejection, the amount of training data from the baseline LDC Hub4 data decreased from 66 hours to 62.8 hours. The automatically extracted CCtrain data decreased from 131.4 hours to 111.5 hours. The data yield from the original 709.8 hours was therefore approximately 15.6%. All together, the amount of usable data totaled 174.3 hours available for further training. Codebook generation This process takes two steps: gathering the feature vectors together and clustering them into different densities. During the gathering step, we aggregate all the feature vectors of all the frames in the training data. Each feature vector covers a 20-millisecond window with 10 milliseconds overlap, and consists of 39 features made up of 12 mel-scale frequency cepstral feature coefficients, 12 delta (differenced) mel-scale cepstral coefficients, 12 delta delta (differenced differences) mel-scale cepstral coefficients, and three corresponding power features. The clustering step uses K-means clustering to partition the training vectors into different Gaussian densities. Each density is represented by its mean and variance in the 39-dimensional vector spaces. Model reestimation As is typical for HMM-based speech-recognition systems, we use the Baum-Welch algorithm (also known as the forward-backward algorithm) to model the observations in the training data through the HMM parameters. This algorithm is a kind of EM (expectation maximization) algorithm that iterates through the data first in a forward pass and then in a backward pass. During each pass, we adjust a set of probabilities to maximize the probability of a given observation in the training data corresponding to a given HMM state. Because this estimation problem has no analytical solution, incremental iterations are necessary until a convergence is achieved. In each iteration the algorithm tries to find better probabilities that maximize the likelihood of observations and training data. During this phase, we reestimate the mixing weight, transition probabilities, and mean and variance parameters. After each Baum-Welch reestimation iteration, we insert a normalization step. We compute the reestimated model parameters from the reestimation counts obtained through Baum-Welch. This combined Baum-Welch and normalization iteration repeats until we achieve an acceptable parameter convergence. Speech decoding We decode speech with a Viterbi search, which finds the state sequence that has the highest probability of being taken while observing the features of the speech sequence. Because this experiment’s purpose was to evaluate the new acoustic model’s effectiveness, we perform only a single search pass. To get optimal performance from the system, we can use multiple passes where we perform maximum-likelihood linear regression between passes for speaker adaptation. The Hub4 evaluation system consists of three such decoding passes with acoustic adaptation steps between each pass. For each pass, we apply an A* beam search for the Viterbi decoding pass and a directed acyclic graph (DAG) search for the best path pass of the Viterbi word lattice. Finally, we generate and rescore N-best hypotheses. Our baseline CMU Hub4 system has an overall WER of 24% when all passes are applied.8 Because the test data in our experiments comes from a different set, and because we do not perform multiple adaptation passes, the Hub4 CMU SphinxIII system’s baseline WER is 32.82%. Language model In addition to the acoustic model, the decoding system uses a language model that reflects the likelihood of different word sequences. We used a Good-Turing discounted trigram back-off language model. It is trained on broadcast news language model-training data and built using a 64,000-word vocabulary.8 By increasing the language model’s contribution relative to the acoustic model, we make the system less sensitive to acoustic errors and more sensitive to word-transition probabilities. To get increased benefit from the language model, we can apply an additionally N-best rescoring pass. In general, the recognition accuracy is highest when decoding with N-best rescoring followed by DAG best-path search and then the single 1-best Viterbi decoding. Multipass decoding might provide higher accuracy with contributions by a better search through the hypothesis lattice and with smoothed language models. Evaluation results In the following experiments, we decoded our test speech data with only a single-pass Viterbi search and without maximum-likelihood linear-regression acoustic adaptation. We did this because we intended to estimate the improvement achieved on the acoustic model compared to the baseline model, where single-pass decoding is sufficiently indicative. Consistent with standard practice in speech evaluations, the recognition word error is not computed simply as the number of miscrecognized words; it also includes deletion and substitution. The error rate is the error normalized by the number of words in the test set. So, word error = insertion + deletion + substitution, and WER = word error  total number of words in the test set. The test data for our results came from the 1996 DARPA Hub4 broadcast news development data (Dev’96 test set).9 It consists of 409 utterances, which is equal to 16,456 words with a total duration of 1.5 hour. Figure 4 details the Dev’96 test set. F0 : Baseline broadcast speech F1 : Spontaneous broadcast speech F2 : Speech over telephone channels F3 : Speech in the presence of background music F4 : Speech under degraded acoustic conditions F5 : Speech from nonnative speakers Fx : All other speech Figure 4. Hub4 acoustic conditions. The 7 conditions (F0, F1, F2, F3, F4, F5, Fx) represent different kinds of speech in different acoustic environments. We use different acoustic models suitable for reduced bandwidth speech for the telephone (F2) conditions. We decided to exclude the F2 condition and only evaluate the nontelephone (F0, F1, F3, F4, F5) conditions because we are improving a full-bandwidth nontelephone mixture Gaussian model, not training a narrow-bandwidth telephone mixture Gaussian model. The different acoustic categories in our Dev’96 test set were not evenly distributed. The number of words in each category varied widely—out of 16,456 words, the F0 condition contained 4,437 words; F1, 5,706; F3, 1,809; F4, 2,697; and F5, 1,807. Figure 5 shows the overall result statistics (including all acoustic conditions together) on the Dev’96 test set. This test set includes both the baseline Hub4cmu system and our improved system (CCtrain named for the fact the we TRAIN with CC (closed-captioned) data), which we built by estimating the acoustic model parameters from both the 62.8 hours of Hub4 training data and the 111.5 hours of our automatically derived training corpus. We based this corpus on closedcaption data and initial recognition transcripts. For all three different language-model weights (3, 6, and 9), our CCtrain acoustic model achieved improvements over our baseline system; the model used a total of 174.3 hours of training data. Figure 5. Recognition word-error rate for the Hub4cmu-trained acoustic baseline models compared to the CCtrain models for the Dev’96 test set, with three different language-model weights. Figure 6 depicts the absolute percentage decrease, again with language-model weights, this time of 6, 9, and 13. Figure 6. The absolute difference in word-error rate between the CCtrain system (62.8 hours of manual transcripts + 111.5 hours of automatically extracted transcripts = 174.3 hours) and the Hub4cmu baseline system (62.8 hours of manual transcripts only) using three different languagemodel weights. Higher language-model weights make it harder to estimate the degree of acoustic-model improvement. This is due to the language model’s increasing contribution to speech-recognition accuracy during decoding. Figures 5 and 6 show that adding our CCtrain corpus to the manually transcribed Hub4 training data improves the overall speech-recognition accuracy. Figure 7 shows the WER for the individual acoustic conditions, contrasting the baseline Hub4cmu system with our CCtrain adapted system. The CCtrain model provides superior results on F0, F3, F4, and F5. For F1, its performance is comparable. The different acoustic conditions result in widely varying WERs, reflecting the difficulty of the condition for speech recognition as well as the varying amounts of training data in each condition. For this and all subsequent figures, we set the language-model weight to six to demonstrate the acoustic component’s effect, with the language model providing relatively little influence. Figure 7. CCtrain versus the Hub4cmu baseline recognition word-error rate for each acoustic condition: F0 = clean broadcast speech, F1 = spontaneous speech, F3 = speech with background music, F4 = speech under degraded acoustic conditions, and F5 = speech from nonnative speakers. To illustrate the transcription-extraction approach’s power, we built another acoustic model exclusively from the extracted training data (CCtrain-only), without any of the manually transcribed Hub4 training data. To emphasize the difference between human-transcribed data and automatically extracted data, we also limited the training to 70.7 hours of transcribed data. Thus, the CCtrain-only automatically extracted data condition is comparable to the Hub4-only condition. Figure 8 shows that the CCtrain-only condition performs slightly better than the Hub4only condition. We also found that an expanded CCtrain model (CCtrain + Hub4) outperforms our baseline Hub4cmu system. In the expanded CCtrain condition we are merely increasing the amount of training data to 174.3 hours by adding automatically extracted data to the manually transcribed Hub4 training data without any other adjustments. Figure 8. Word-error rate of CCtrain built from 174.3 hours of data consisting of our extracted segments plus Hub4 data, Hub4cmu built from 62.8 hours of Hub4 data, and CCtrain built from 70.7 hours of extracted segments only of training data. These word-error rates represent each acoustic condition: F0 = clean broadcast speech, F1 = spontaneous speech, F3 = speech with background music, F4 = speech under degraded acoustic conditions, and F5 = speech from nonnative speakers. Error analysis To analyze the effect of the additional training data in CCtrain on the recognition performance, we examined the word errors in the test set with respect to the number of training instances in the Hub4 and the CCtrain training data. In Figure 9, we only examined words that were present in both the training corpus as well as the Dev’96 test data. The x-axis represents the number of times a word was present in the training set, and the y-axis is the number of errors for that same word in the Dev’96 test set. Each word is represented by two points with a line connecting them. The first point represents the number of training instances for the word in the Hub4 training corpus and the number of errors for this word during the Hub4cmu baseline recognition. The second point is based on the CCtrain corpus. The CCtrain point is determined by the number of training instances in the CCtrain corpus on the xaxis and the number of errors for that word during the CCtrain decode on the y-axis. Because the CCtrain training corpus included all the Hub4 words as well as new training material, the CCtrain point is always on the right end of a line, while the Hub4cmu point is on the left end. We divided the data into two sets of words, those words where the errors decreased with more training data and those where the errors increased after adding the additional CCtrain data. The descending (gray) lines show the recognized words with the fewest errors in the CCtrain model while the ascending (bold) lines show the words that were recognized less accurately after CCtrain. Figure 9. The effect of word frequency in the different training corpora on the number of errors. Each word in the test data is plotted based on its errors in the Hub4cmu baseline recognition given its frequency during Hub4 training (the left point of each line) compared to the CCtrain frequency and recognition errors for that word (the right point of each line). An ascending line shows an increase in the error rate for a word (17.4% of the Dev’96 test-set words), while a descending line reflects a decrease in error rate by the CCtrain model (82.6% of the Dev’96 testset words). Figure 10 plots the same data. Instead of plotting both the Hub4cmu baseline word-training frequency and total errors together with the CCtrain training frequency and errors, we subtract the CCtrain word errors from the Hub4cmu errors, and we use the difference in training frequencies between CCtrain and Hub4cmu on the x-axis. Each data point in Figure 10 now represents the increase in word training frequency for the CCtrain corpus on the x-axis and the difference in word errors on the y-axis between the two conditions. If the word error does not change, the difference in errors is 0. If the Hub4cmu word errors were lower than the CCtrain error rate, the data values in Figure 10 will lie above the 0 point on the y-axis. If the CCtrain showed an improvement for that word, the point will be plotted below 0 on the y-axis. Examining Figure 10 shows that the word errors usually decrease when we increase the number of training instances by 2,000 or more. Figure 10. A plot of the difference in errors for each word between CCtrain and Hub4cmu model recognition in the Dev’96 test data (y-axis). The x-axis shows the increase in training-word frequency for CCtrain over Hub4cmu for a given word. Increases in word error for CCtrain are positive on the y-axis (17.4% of all words), while decreases are negative (82.6%). Figure 11 shows all the words in the test data, plotted as a function of their frequency in the training data. To show the trend of the individual points, a line represents the average WER in histogram bins, where each bin combines 30 adjacent frequencies. Figure 11. The word-error rate in the CCtrain condition for each word in the Dev’96 test set, plotted as a function of the number of training instances for that word. The trend line shows the average error rate for words in a histogram bin of 30 similarly frequent words. At about 2,000 words, the number of training instances no longer helps reduce the word-error rate. Up to 2,000 words, each additional training instance of a word helps reduce its error rate. Because we are training HMMs based on triphones, not words, we also analyzed the behavior of the triphones with respect to the number of training instances observed. As we described earlier, a triphone is a context-dependent phoneme where its left and right phoneme context is taken into account. A context-independent phoneme does not carry its context information. For example, a phoneme IH in the context of TH IH NG (as in the word something), is a triphone IH where TH is its left phoneme context and NG is the right phoneme context. If the corpora to train the acoustic model have little or no noise, we observe words or triphones in the training set more often. Therefore, we would expect better acoustic models. Predicting informative triphone subsets The analysis of our test data indicated that training on about 1,000 instances of a triphone or 2,000 instances of each word would be sufficient to obtain acceptable recognition performance. To verify this hypothesis, we conducted an experiment with selective training of rare or underrepresented triphones or words, by excluding redundant training data that had no instances of triphones below the threshold. We hypothesized a threshold, set to 1,000, on the triphone frequency where the triphone error rate started to closely approximate the asymptotic upper bound. We extract a subset of our automatically transcribed speech training data by adding a new segment to the subset only if the frequency of any triphone within the segment was below our hypothesized threshold. We trained an acoustic model on this training subset to see whether the hypothesized threshold near the asymptote on our plot is a good indicator of the number of triphone observations required. Figure 12 shows the results of single-pass decodings. Our selected training subset of 66.7 hours, which included only approximately 60% of the original training set of 111.5 hours, yielded a WER decrease of 6.76% relative (2.22% absolute) to the Sphinx baseline system built from 62.8 hours of Hub4’97 training data. However, our selection technique resulted in a WER increase of 0.25% compared to the complete 111.5 hours of the CCtrain data set. Figure 12. Word-error rate comparison of models built from three nonselected sets (62.8 hours Hub4 baseline, 111.5 hours CCtrain, and 174.3 hours combined) and one selected set (66.7 hours selected from 111.5 hours of CCtrain data). If we take only 111.5 hours of unselected CCtrain data, we obtain a WER that is 0.8% lower than when this Cctrain data is combined with the Hub4’97 manually transcribed data yielding a total of 174 hours. However, one would expect more training data to result in better WER. We believe this lower WER on less training data is because our automatic extraction technique produces more accurate transcriptions than those found in the manual Hub4’97 training data. It might also appear contradictory to first extract an arbitrarily large amount of transcribed speech and then to reduce that set into a smaller subset, but selecting a subset from the original, larger amount of speech increases the probability of incorporating more diverse triphones. Although the training utterances derived from TV broadcasts, rough speech-recognition transcripts, and closed captions are not completely error-free, a potentially unlimited amount of this data is freely available from TV broadcasts 24 hours a day. The automated procedures for deriving accurate transcriptions allow virtually unlimited amounts of data to be processed, limited only by CPU cycles and storage space. We overcame remaining errors in the training data transcripts by the large quantity of data. Selection from collected speech data based on triphone frequency thresholding reduces training costs, results in better decoding performance than when a similar amount of unselected manually transcribed speech is used, and performs comparable to much larger amounts of speech training sets. One possible criticism of our scheme for learning acoustic-model parameters is that the approach can only identify sections of speech on which the recognizer already performs well. However, a dynamically constructed language model based on the captioned words, together with a high language-model preference in the recognizer’s first pass, can diminish this bias. The identified sections of speech have sufficient variability to provide useful training data, but eventually an asymptotic plateau might be reached. We are also working on improving the quality of training corpora with confidence annotation. Extracting segments based on confidence would be an alternative to selecting the conjunction of closed captions and the recognition of a speech system, as Zavaliagkos and Colthurst described.1 Acknowledgments This article is based on work supported by the National Science Foundation under Cooperative Agreement IIS-9817496. Any opinions, findings, conclusions, or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the NSF. References 1. 2. 3. 4. 5. 6. 7. 8. 9. G. Zavaliagkos and T. Colthurst, “Utilizing Untranscribed Training Data to Improve Performance,” Proc. DARPA Speech Recognition Workshop, 1998, pp. 301305. M. Witbrock and A.G. Hauptmann, Improving Acoustic Models by Watching Television, Tech. Report CMU-CS98-110, Computer Science Dept., Carnegie Mellon Univ., Pennsylvania, Pa., 1998. K.F. Lee, H.-W. Hon, and R. Reddy, An Overview of the Sphinx Speech Recognition System: Readings in Speech Recognition, Morgan Kaufmann, San Francisco, 1990. A. Rudnicky, “Language Modeling with Limited Domain Data,” Proc. ARPA Workshop on Spoken Language Technology, 1996. H. Bourlard, H. Hermansky, and N. Morgan, “Copernicus and then ASR Challenge—Waiting for Kepler,” Proc. DARPA Speech Recognition Workshop, 1996. M. Hwang et al., “Improving Speech Recognition Performance via Phone-Dependent VQ Codebooks and Adaptive Language Models in Sphinx-II,” Proc. Int’l Conf. Acoustics, Speech and Signal Processing, Vol. I, 1994, pp. 549552. H. Nye, “The Use of a One-Stage Dynamic Programming Algorithm for Connected Word Recognition,” IEEE Trans. Acoustics, Speech, and Signal Processing, Vol. AASP-32, No. 2, Apr. 1984, pp. 263271. K. Seymore et al., “The 1997 CMU Sphinx-3 English Broadcast News Transcription System,” Proc. DARPA Speech Recognition Workshop, 1998. D. Pallett, “Overview of the 1997 DARPA Speech Recognition Workshop,” Proc. DARPA Speech Recognition Workshop, 1997. The Informedia Digital Video Library Researchers working on the Informedia Digital Video Library Project at Carnegie Mellon University are creating a digital library of text, image, video, and audio data, whose entire content can be searched for rapid retrieval of material relevant to a user query. 1 Through the integration of technologies from natural language understanding, image processing, speech recognition, and information retrieval, the Informedia Digital Video Library system lets users explore the multimedia data both in depth and in breadth. Speech recognition is a critical component in the automated library-creation process.2,3 Users can employ automatic speech recognition to transcribe the audio portion of the video data stored in MPEG format in the library. This speechrecognizer-generated transcript forms the basis for the main text search and retrieval functions. The transcripts also let the natural-language component proceed with functions such as summarizing the library documents, classifying them into topics, translating documents, and creating descriptive titles. References 1. 2. 3. M. Christel et al., “Techniques for the Creation and Exploration of Digital Video Libraries,” Multimedia Tools and Applications, Vol. 2, Borko Furht, ed., Kluwer Academic Press, Boston, 1996 A.G. Hauptmann and H.D. Wactlar, “Indexing and Search of Multimodal Information,” Proc. Int’l Conf. Acoustics, Speech and Signal Processing, 1997 M.J. Witbrock and A.G. Hauptmann, “Artificial Intelligence Techniques in a Digital Video Library,” J. Amer. Soc. for Information Science, 1997.

Acoustic model training - Informedia Digital Video Library

Related documents

Products

Support

Acoustic model training - Informedia Digital Video Library

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib