Simultaneous German-English Lecture Translation

advertisement
Proceedings of IWSLT 2008, Hawaii - U.S.A.
Simultaneous German-English Lecture Translation
Muntsin Kolss1 , Matthias Wölfel1 , Florian Kraft1 , Jan Niehues1 , Matthias Paulik1,2 , and Alex Waibel1,2
1
2
Fakultät für Informatik, Universität Karlsruhe (TH), Germany
Language Technologies Institute, Carnegie Mellon University, USA
{kolss, wolfel, fkraft, jniehues, paulik, waibel}@ira.uka.de
Abstract
In an increasingly globalized world, situations in which
people of different native tongues have to communicate with
each other become more and more frequent. In many such
situations, human interpreters are prohibitively expensive or
simply not available. Automatic spoken language translation
(SLT), as a cost-effective solution to this dilemma, has received increased attention in recent years. For a broad number of applications, including live SLT of lectures and oral
presentations, these automatic systems should ideally operate in real time and with low latency. Large and highly specialized vocabularies as well as strong variations in speaking
style – ranging from read speech to free presentations suffering from spontaneous events – make simultaneous SLT of
lectures a challenging task.
This paper presents our progress in building a simultaneous German-English lecture translation system. We emphasize some of the challenges which are particular to this
language pair and propose solutions to tackle some of the
problems encountered.
1. Introduction
At educational institutions around the world, most lectures
are given in the local language. Lectures and oral presentations form a core part of knowledge dissemination and
communication, and as the number of exchange programs
grows, this puts foreign students and visitors at a disadvantage and severely reduces opportunities for collaboration and
exchange of ideas. Likewise, lectures and presentations at
foreign organizations do not take place or are awkward because speaker and audience do not feel comfortable enough
outside their native tongue. While human translation services
are not an option to overcome the language barrier as costs
would be prohibitive, effective automatic translation systems
would go a long way towards making lectures and other presentations with live interaction more accessible.
Building an automatic system for simultaneous translation of lectures poses many challenges. First, lectures have
a wide variety of topics that cover a virtually unlimited domain, and often go into much more detail than spoken language encountered in other speech translation tasks, such as
limited-domain dialogs for travel assistance or translation of
parliamentary speeches. The vocabulary and expressions can
- 174 -
become very specialized with precise meanings that differ
from the general usage. Since many lecturers are not professional speakers, the speaking styles can vary considerably
and are much more conversational and informal than in prepared speeches or short utterances. The language is often
ungrammatical and contains many disfluencies such as repetitions and false starts, and while the spoken input is usually
a more or less constant flow of words, it does not necessarily
consist of separable sentences with boundaries.
Obviously, such a system must run in real time since the
speaker will not wait for the system to catch up. In addition,
the system should translate with low latency, since a long delay between the original utterance, during which the speaker
may navigate with a light-pointer over a slide, and the output
of the translation will make it more difficult for the listener
to follow the lecture, and could severely impact the understanding of the presented material.
In [1], a translation system for speeches and lectures
was presented which translated English lectures into Spanish. Translating German lectures into English, as presented
in this work, introduces additional difficulties not encountered in [1].
Speakers of non-English languages tend to embed more
English words in their speech, especially for technical terms
which are accepted in many disciplines. If unhandled, these
English words will be misrecognized and cause additional
follow-up errors. Similarly, the large number of compound
words in the German language must be dealt with in both the
speech recognizer, since they form an unlimited vocabulary,
and the machine translation component, where alignment is
based on word-to-word correspondences. The significant difference in word order between German and English poses
another challenge for machine translation which is emphasized by the need for real-time processing and latency requirements.
In this paper, we present our progress in building a system for simultaneous translation of lectures from German to
English. Section 2 gives an overview of the architecture and
the components of our current system. In Sections 3 and
4, we explain in detail the speech recognition and machine
translation components, respectively, and show individual results obtained during the development. Section 5 presents an
end-to-end evaluation of the system and highlights a number
of problems, and Section 6 concludes the paper.
Proceedings of IWSLT 2008, Hawaii - U.S.A.
German Speech
by native German speakers with different microphone types
and speaker-to-microphone distances.
As a speech recognition engine we used the Janus Recognition Toolkit (JRTk). The speech recognition system configuration is described and individually evaluated in the next
sections.
Input
Front-End
ASR
Decoder
3.1. Front-end
Segmentation
Since it has been shown that Mel frequency cepstral coefficients (MFCC)s or perceptual linear prediction (PLP) coefficients are outperformed by warped MVDR cepstral coefficients [4] in noisy conditions, we decided to use the warped
MVDR front-end for our experiments. The warped MVDR
spectral envelopes, with a frame size of 16 ms and shift of
10 ms, were directly converted into a truncated cepstral sequence. The resulting 20 mean and normalized cepstral coefficients were stacked (seven adjacent left and right frames)
and truncated to the final feature vector dimension of 42 by a
multiplication with the optimal feature space matrix (the linear discriminant analysis matrix multiplied with the global
semi-tied covariance transformation matrix [5]).
Front-End
MT
Decoder
Synthesis
Output
German
Text
English
Text
English
Speech
Figure 1: System overview of the individual components.
3.2. Acoustic model
2. System overview
This section briefly describes the system overview of the lecture speech-to-speech translation system as summarized in
Figure 1.
The input modality is limited to a single microphone.
Additional input modalities could include additional microphones or slide information. The slide information can, for
example, be used to adapt the language model weights on a
slide by slide basis [2].
The speech-to-speech translation system can be decomposed into four main parts:
1. automatic speech recognition
2. segmentation
3. machine translation
4. speech synthesis
A detailed explanation of the different components follow in
later sections. For synthesis we have used an English male
voice provided by Cepstral [3].
The output modality contain the text transcripts in German, the written translation in English and a synthesized of
the English text.
3. Speech recognition
In order to evaluate the speech recognition performance under realistic conditions we have recorded and transcribed 17
hours of German lecture speech (continuous, freely spoken)
- 175 -
The acoustic training material used for the experiments reported here consisted of approximately 17 hours of in-house
recordings of lectures from various speakers, including the
test speaker, resulting in 2,000 context dependent codebooks
with up to 64 Gaussians with diagonal covariances each.
The final acoustic models have been discriminatively trained
due to maximum mutual information estimation where the
speaker dependent adaptation matrices have been kept unchanged during training.
3.3. Vocabulary selection and compound splitting
The German language contains a large amount of compound
words and morphological forms which cause additional difficulties for automatic transcription. Compounds are comprised of at least two nouns and thus an innumerable number
of possible words can be formed. The number of inflections
in German can be a multiple in comparison to the inflections
in English. Thus the vocabulary size increases much faster
for German than for English, causing either
• a bad vocabulary coverage (if truncated), or
• an underestimation of language model weights and
• a significant increase in search space complexity.
To reduce the number of words in the vocabulary of the
recognition system and to reduce, at the same time, the out of
vocabulary (OOV) words, we split the compound words into
their constituents by comparison with a reference vocabulary.
The reference vocabulary was created by Hunspell [6] using
a base vocabulary tagged with word types and a corresponding affix list.
Proceedings of IWSLT 2008, Hawaii - U.S.A.
The final vocabulary set has been selected from small indomain corpora (lecture transcriptions, presentation slides,
web data) and large out-of-domain corpora (broadcast news).
On the latter the select-vocab tool from the SRILM-toolkit
[7], which estimates weighted and combined counts for all
words, has been used to extract a ranked list of vocabulary
entries. We reduced the resulting vocabulary size from approximately 255,000 entries to a manageable size of 65,000
words, to find a compromise between coverage and search
space complexity. The reduced vocabulary size increased the
OOV rate from 1.6% to 3.1%. We observed that this OOV
rate is mainly due to special English expressions, compound
words and inflection forms not seen in the training data. After merging the 65,000 words vocabulary with the vocabulary
generated from the small in-domain corpora we were able to
reduce the OOV rate to 2.3%. By splitting the compound
words the OOV rate dropped to 1.5% (case sensitive) or 1.2%
(case insensitive).
3.4. Language model
To train a 4-gram language model, we used the same small
in-domain corpora and large out-of-domain corpora as has
been used for vocabulary selection. For the web-collection
we used the scripts, with small modifications to compensate
for encoding issues arising for German text, provided by the
University of Washington [8]. The search queries were created out of the most frequent bi- and trigrams from the indomain corpora. The bi- and trigrams were kept if no stopword was included and at least one word had been in upper case. We found that the casing restriction leads to better
keywords since in German nouns have to be written in upper case. The queries consisting of a combination of 2004
bigrams and 768 trigrams retrieved approximately 10,000
HTML pages and 8,000 PDF files which resulted in 41 million words after appropriate filtering. The web-collection reduced the language model perplexity from 280 to 246 which
significantly improved the word accuracy by more than 5.0%
relative.
3.5. English words in German lectures
A general problem in automatic speech recognition is the
transcription of foreign words. Our analysis of lectures in
German language given at Universität Karlruhe (TH) showed
that only 64% or 4195 out of 6589 words could be classified as uniquely German, i.e., words found only in the German Hunspell vocabulary. English words are represented by
only 2% or 110 words in the transcripts, while 21% or 1397
words were represented in both languages. The remaining
887 words, most of them fillers, were found in neither the
German nor the English Hunspell vocabulary and thus were
counted as unknown.
An analysis of the recognition errors, see Table 1, shows
that the English words cause a significant degradation in
recognition performance. The overall word error rate of the
- 176 -
Language
Total Words
Deletions
Insertions
Substitutions
German
English
Both
Unknown
Total Error
WER
German
4195
52
58
English
110
1
9
Both
1397
44
37
Unknown
887
0
2
258
7
68
5
448
10.7%
37
6
10
3
66
60.0%
91
8
33
2
215
15.4%
113
7
56
4
182
20.5%
Table 1: Absolute errors and word error rate (WER) by the
different languages for the baseline system. Results from [9]
evaluated system is 13.8%, but the German words have a
word error rate of only 10.7% while the English words have
a word error rate of 60.0%.
In order to reduce the performance gap between the
German words in a German speech recognition system and
the English words in a German speech recognition system,
one could employ both the German as well as the English
phoneme set (parallel) or map the English pronunciation dictionary to German phonemes (mapping). Both methods have
been successfully applied as described in [9], and the best
single method, mapping, leads to an absolute word error reduction by more than one percent.
3.6. ASR results
We evaluated on several different speaker-to-microphone distances, using unadapted (first pass) acoustic models, as well
as acoustic models (second pass) after unsupervised adaptation with maximum likelihood linear regression (MLLR),
constrained MLLR, and vocal tract length normalization.
We observe, Table 2, that the recognition accuracy suffers significantly by moving the microphone away from the
speaker’s mouth. The warped MVDR front-end is leading
to better results than the MFCC front-end, in particular for
severe acoustic environments. Different speech feature enhancement techniques to reduce the performance gap between distant and close recordings are investigated in [10].
We also compare the runtime performance of the system
on different microphones which belong to different speakerto-microphone distances respectively. The runtime factors
have been measured on an Intel Xeon processor with 3.2
GHz and 3.5 GByte of RAM. We have observed that due
to more difficult environment, the decoding in the ASR system takes longer for channels which are further away from
the speaker’s mouth. While no degradation in word accuracy
can be observed between the CTM and the lapel microphone,
the real time factor (RTF) increases by 25%. This increase is
more severe for the table top and wall mounted microphone,
125% and 290% respectively.
Proceedings of IWSLT 2008, Hawaii - U.S.A.
Microphone
CTM
Lapel Table Top
Wall
Distance
5 cm
20 cm
150 cm
350 cm
SNR
26 dB
23 dB
17 dB
12 dB
Pass
1
2
1 2
1
2
1 2
Front-End
Word Error Rate %
power spectrum 13.0 13.1 13.5 13.4 26.6 20.6 40.5 26.5
warped MVDR 13.4 13.4 13.1 13.1 26.8 19.7 39.5 24.7
Table 2: Word error rates for different speaker-tomicrophone distances and front-ends.
4. Machine translation
In its baseline configuration, the machine translation component in our online lecture translator is a phrase-based statistical machine translation system which uses a log-linear combination of several models: an n-gram target language model,
phrase translation models for both directions p(f |e) and
p(e|f ), a distance-based distortion model, a word penalty
and a phrase penalty. The model scaling factors for these features are optimized on the development set by minimum error rate training (MERT) [11]. Search is performed using our
in-house STTK-based beam search decoder which allows restricted word re-ordering within a local window during translation.
4.1. Word alignment
In statistical machine translation (SMT), parallel corpora are
often the most important knowledge source. These corpora
are usually aligned at the sentence level, so a word alignment
is still needed. For a given source sentence f1J and a given
target sentence eI1 a set of links (j, i) has to be found, which
describes which source word fj is translated into which target word ei .
Most SMT systems use the freely available GIZA++Toolkit [12] to generate the word alignment. This toolkit implements the IBM- and HMM-models introduced in [13, 14].
They have the advantage of unsupervised training and are
well suited for a noisy-channel approach, but introducing additional features into these models is difficult.
In contrast, we use the discriminative word alignment
model presented in [15], which uses a conditional random
field (CRF) to model the alignment matrix. This model is
symmetric and no restrictions to the alignment are required.
Furthermore, it is easy to use additional knowledge sources
and the alignment can be biased towards precision or recall.
In the framework, the different aspects of the word alignment are modeled by three groups of features. The first
group of features depend only on the source and target words
and may therefore be called local features. The lexical features, which represent the lexical translation probability of
the words as well as the source and target normalized translation probability of the words belong to this group. In addition, the following local features are used: The relative dis-
- 177 -
tance of the sentence positions of both words. The relative
edit distance between source and target word is used to improve the alignment of cognates, and a feature indicating if
source and target words are identical is calculated. Lastly,
the links of the IBM4-alignments are used as an additional
local feature. In the experiments this leads to 22 features.
The next group of features are the fertility features that
model the probability that a word translates into one, two,
three or more words, or does not have any translation at all.
In this group there are indicator features for the different fertilities and a real-value feature, which can use the GIZA++
probabilities. To reduce the complexity of the calculation,
this is only done up to a given maximal fertility Nf and there
is an additional indicator feature for all fertilities larger than
Nf . So 12 fertility features were used in the experiments.
The first-order features model the first-order dependencies between the different links. They are grouped into different directions. For example, the most common direction
is (1, 1), which describes the situation that if the words at
positions j and i are aligned, also the immediate successor
words in both sentences are aligned. In our configuration the
directions (1, 1), (2, 1), (1, 2), (1, −1), (1, 0) and (0, 1) are
used. For every direction, an indicator feature that both links
are active.
Since the structure of the described CRF is quite complex, the inference cannot be done exactly. Instead, belief
propagation was used to find the most probable alignment.
The weights of the CRFs are trained using a gradient descent for a fixed number of iterations. The CRFs were first
trained using the maximum log-likelihood criteria of the correct alignment. Since the hand-aligned data is annotated with
sure and possible links, the CRFs is afterwards trained towards an approximations of the alignment error rate (AER).
4.2. Model training
The bulk of training data for our translation system comes
from the parallel German-English European Parliament Plenary Speeches (EPPS) and News Commentary corpora, as
provided by WMT08 [16]. To this we added a smaller corpus of about 660K running words consisting of spoken language expressions in the travel domain, and a corpus of about
100K running words of German lectures held at Universität
Karlsruhe (TH) which were transcribed and translated into
English, yielding a parallel training corpus of about 1.46M
sentence pairs and 36M running words.
For machine translation experiments, we applied compound splitting to the German side of the corpus, using the
frequency-based method described in [17] which was trained
on the corpus itself. Even though this method splits more
aggressively and generates shorter and partially erroneous
word fragments, it produced better translation quality than
the method used for the speech recognizer described in Section 3.3. After tokenization and compound splitting, we performed word alignment, using the GIZA++ toolkit and, for
the final system, the method described in the previous sec-
Proceedings of IWSLT 2008, Hawaii - U.S.A.
tion, and extracted bilingual phrase pairs with the Pharaoh
toolkit [18]. 4-gram language models were used in all experiments, estimated with modified Kneser-Ney smoothing as
implemented in the SRILM toolkit [7].
4.3. Model adaptation
The performance of data-driven MT systems depends heavily
on a good match between training and test data in terms of
domain coverage and speaking style. This applies even more
so in our case, as university lectures can go deeply into a very
narrow topic and, depending on the lecturer, are often given
in a much more informal and conversational style than that
found in prepared speeches or even written text.
To adapt the MT component towards this lecture style,
we trained two separate sets of translation and language models: one set which was trained on the complete parallel training corpus for broad coverage, and additional smaller models
trained only on the collected lecture data described in Section 4.2.
For decoding, we combine the two generated phrase tables into a single one, but retain disjunct model features for
each part such that the corresponding feature scaling weights
can be optimized separately during MERT. This approach
improves significantly over the baseline of using a single
phrase table trained on the complete corpus.
Likewise, our decoder uses the two language models with
separate model scaling factors. For our system, this approach
gives the same performance improvement as interpolating
the language models based on perplexity on the development
set. The improvements from model adaptation are shown in
Table 3.
4.4. Front-end
In order to get the best translation results, the output of the
speech recognizer has to be pre-processed to match the format expected by the translation component. We first perform
compound splitting as described in Section 4.2, on top of the
compound splitting already done in the speech recognizer.
Because the word order in German and English is very
different, reordering over a rather limited distance like done
in many phrase-based systems does not lead to a good translation quality. We experimented with rule-based reordering
as proposed in [19], in which a word lattice is created as a
pre-processing step to encode different reorderings and allow somewhat longer distance reordering. The rules to create
the lattice are automatically learned from the corpus and the
part-of-speech (POS) tags created by the TreeTagger [20]. In
the training, POS based reordering patterns were extracted
from word alignment information. The context in which a
reordering pattern is seen was used as an additional feature.
At decoding time, we build a lattice structure for each source
utterance as input for our decoder, which contains reordering
alternatives consistent with the previously extracted rules to
avoid hard decisions.
- 178 -
Baseline
Language model (LM) adaptation
Translation model (TM) adaptation
LM and TM adaptation
+ Rule-based word reordering
+ Discriminative word alignment
dev
31.54
33.11
33.09
34.00
34.59
35.24
test
27.18
29.17
30.46
30.94
31.38
31.40
Table 3: Translation performance in %BLEU on manual
transcripts for model adaptation, rule-based reordering, and
discriminative word alignment.
4.5. Input segmentation
Machine translation systems need a certain minimum context
and perform best when given more or less well-formed sentence or utterance units to translate. The standard approach to
coupling speech recognition and machine translation components into an integrated speech translation system is segmenting the unstructured first-best ASR hypothesis into shorter,
sentence-like units prior to passing them to the subsequent
machine translation component.
Choosing good segment boundaries, ideally corresponding to semantic or strong syntactic boundaries, can have a
big impact on the final translation performance [21, 22, 23].
Some of the most useful features for segmentation are using a
source language model, pause information from the original
audio signal, since pauses often correspond to punctuation
or semantic boundaries, and observing reordering and phrase
boundaries within the translation system.
For simultaneous translation, we have the additional requirement that the segmenter, like the other components,
must run in real-time. In our system, we use silence regions
and a tri-gram source language model to identify segment
boundaries, but apply additional thresholds to produce segments with an average length of about seven words. As Figure 2 shows, both the distribution of segment lengths for our
automatic segmenter and manually produced segments also
contain many segments which are considerably longer.
4.6. Stream decoding
The standard pipeline approach poses a dilemma for realtime speech translation systems: On the one hand, allowing
longer segments generally leads to better translation quality.
On the other hand, this translates directly into longer latency
as well, i.e., the delay between words being uttered by the
speaker and the translation of these words being delivered to
the listener.
In addition, choosing meaningful segment boundaries is
difficult and error-prone, and introducing boundaries has obvious drawbacks: each boundary destroys the available context for language modeling and finding longer phrase translations, and no word reordering across segment boundaries
is possible. As we and others have found, rather long seg-
Proceedings of IWSLT 2008, Hawaii - U.S.A.
Number of Occurrences
120
tice is truncated at the start to reflect this.
manual
automatic
4.6.2. Asynchronous input and output
100
80
60
40
20
0
1
10
20
30
Segment Length
40
47
Figure 2: Number of occurrences of segment length for manual and automatic segmentation.
ment lengths of 10-40 words are necessary to get reasonably
close to reaching the potential translation performance of the
underlying translation system. A real-time system based on
this approach must therefore compromise translation quality
by using shorter segment lengths to avoid unacceptable latencies, and is still unable to achieve low latency because the
drop in translation quality is quite severe as soon as this is
enforced.
An alternative approach is completely by-passing the
segmentation stage on the input side and directly processing
the continuous input stream from the speech recognizer in
real-time. The idea is to define a maximum delay or latency
that we are willing to accept, and to decouple the decisions
of when to generate translation output from any fixed input
segmentation. Such a stream decoder design, proposed in
[24], still allows for full word reordering within a sliding local window. Under such constraints, it outperforms the usual
input segmentation pipeline and is able to reach its potential at very low latencies, making it especially well-suited for
integrated real-time simultaneous translation systems.
4.6.1. Continuous translation lattice
Our baseline decoder creates one translation lattice from each
input utterance or segment and adds all translation alternatives as additional edges. After the utterance has been decoded, this translation lattice is discarded for the next input
utterance.
For stream decoding, in contrast, we maintain a continuous, rotating translation lattice to be able to process an ”infinite” input stream from the speech recognizer in real-time.
New incoming source words from the recognizer are added to
the end of the translation lattice, and the lattice is then immediately expanded with all newly matching word and phrase
translation alternatives. When translation output has been
generated for a part of the current translation lattice, the lat-
- 179 -
Each incoming source word triggers a search for the best path
through the current translation lattice, performed similarly as
in the baseline decoder. However, the best translation hypothesis is not immediately output as translation. rather, the
generation of output can be partially or completely delayed,
either until a time out occurs, or until new input arrives from
the recognizer, leading to lattice expansion and a new search.
This creates a sliding time window during which the
translation output lags the corresponding incoming source
stream, with the current translation lattice representing the
still untranslated input.
Once the decision has been made to output a part of the
best current translation hypothesis, the corresponding part of
the translation lattice is removed from the start and the initial
decoder hypothesis for the next search is set to the state at
the end of the committed output.
4.6.3. Output segmentation
Two parameters are used by the stream decoder to decide
which part of the current best translation hypothesis to output, if any at all:
• Minimum Latency Lmin . The translation covering the
most recent Lmin untranslated source words received
from the speech recognizer at any point is never output,
except when a time out occurs. This effectively means
that the decoder postpones translating the current end
of the input stream until more context becomes available.
• Maximum Latency Lmax . When the latency reaches
Lmax source words, translation output covering the
source words exceeding this value is forced.
To find the output boundary, the currently best translation
hypothesis is traversed backwards until the last Lmin source
words have been passed. If the hypothesis reached has no
reordering gap, the translation up to this point is generated.
Committing to a partial translation means that the lattice up
to this point will be deleted, therefore some source words
would remain untranslated if the hypothesis has some open
reordering gaps.
If the hypothesis does contain open gaps, the traversal
continues backwards until a state is reached where all word
reorderings are closed. If no such state can be found, a new
restricted search through the lattice is performed that only
expands hypotheses which have no open reordering at the
node where the maximum latency was exceeded, i.e. that
have translated all source words up to that point.
4.6.4. Comparison with input segmentation
We compared the stream decoder with a baseline system
which reached a performance of 29.04 %BLEU on manual
Proceedings of IWSLT 2008, Hawaii - U.S.A.
30%
30%
BLEU
25%
25%
20%
20%
15%
15%
10%
10%
5%
5%
0%
0
1
2
3 4 5 6 7
Maximum Latency
8
9
0%
10
BLEU
WER
0
0.5
1
1.5
Real Time Factor
2
2.5
Figure 4: Real time factor vs. word error rate and BLEU
score.
Figure 3: BLEU score vs maximum latency.
transcripts. Using the same models and local reordering window, the stream decoder was configured to respect a maximum latency of N words by setting the Lmax parameter to the
corresponding value. The Lmin value was set to N/2, leading
to an average latency of roughly N/2 words as well. As can
be seen in Figure 3, the stream decoding approach performs
as well as the baseline system at greatly reduced latency, and
in addition is able to guarantee a strict maximum delay that
will never be exceeded.
The quick saturation, however, also indicates that there
is no potential for improvement without a much better model
for long-distance reordering in the German-English language
pair. Once such a model becomes available, the interaction
between reordering and latency will have to be studied in
greater detail.
5. End-to-end evaluation
We evaluated our system on a test set consisting of about
11,500 words from lectures held by two different native German speakers. Since our goal was to build a real-time system
and the overall system speed is limited by the ASR component, we ran the system with different settings to come as
close as possible to a setting which makes the most use of
the available computation time. For these tests, we used input segmentation as described in Section 4.5 rather than the
stream decoding described in Section 4.6.
5.1. Timing studies
Figure 4 plots the real time factor vs. word error rate and
BLEU score. The different RTFs are due to a more or less
restricted search beam in the speech recognition system. We
observe a significant performance drop for RTF below 0.9.
This threshold is determined by the acoustic environment and
the number of words in the search vocabulary. For optimal
real-time end-to-end performance, we tuned our vocabulary
to reach a RTF of 0.9 on a lapel microphone with reasonable
background noise.
- 180 -
5.2. Translation quality
Since translation quality scales almost linearly with the word
error rate of the input hypotheses, good ASR performance
is a prerequisite to producing good translations. For our
system and task, a word error rate of about 15% marks the
point where the translation output is becoming useful. As
shown before, our system reaches this threshold under realtime conditions on standard hardware.
Recognition errors in the ASR output lead to a significantly lower translation performance of 23.65 BLEU, compared to 29.04 BLEU on the reference transcription. Nevertheless, the generated English online translation provides a
good idea of what the German lecturer said. One of the most
striking problems we found is the rather awkward word order produced by the online system, which strongly affects the
readability of the English translation. For example, the system produced ”it is a joy here of course this talk to give” from
the German input ”es ist eine Freude natürlich hier diesen
Vortrag zu geben”, while the reference translation was ”it is
a pleasure of course to give this talk here.” For this reason,
our ongoing work focuses, among other things, on extending the rule-based word re-ordering described in Section 4.4.
This approach allows us to apply longer distance reordering
under the given real-time requirements.
6. Conclusion
We have presented our current system for simultaneous translation of German lectures into English, combining state-ofthe-art ASR and SMT components. Several challenges specific to this task were identified and experiments performed
to address them. The German language contains many compound words, and German lectures are often interspersed
with English terms and expressions. Our ASR system was
modified to better handle these points. Compound splitting
was also used in the translation system to improve its quality.
In addition, the different models of the translation systems
were adapted to the topic and style of the lectures, and we
Proceedings of IWSLT 2008, Hawaii - U.S.A.
experimented with reducing the latency of the real-time system. The different word order in the German and English
language is still a challenge which will have to be better addressed in the future.
7. Acknowledgments
This work was partly supported by the Quaero Programme,
funded by OSEO, French state agency for innovation. The
authors would like to thank Susanne Burger, Oliver Langewitz, Edda Heeren, Stephan Settekorn and Marleen Stollen
for transcribing and translating the lecture development and
evaluation data, and Christian Fügen for useful discussions.
8. References
[1] C. Fügen, M. Kolss, D. Bernreuther, M. Paulik,
S. Stüker, S. Vogel, and A. Waibel, “Open Domain Speech Recognition & Translation: Lectures and
Speeches.” in Proc. of ICASSP, Toulouse, France, 2006.
[2] T. Kawahara, Y. Nemoto, and Y. Akita, “Automatic
Lecture Transcription by Exploiting Presentation Slide
Information for Language Model Adaptation,” Proc. of
ICASSP, 2008.
[11] F. Och, “Minimum Error Rate Training in Statistical
Machine Translation,” in 41st Annual Meeting of the
Association for Computational Linguistics (ACL), Sapporo, Japan, 2003.
[12] F. J. Och and H. Ney, “A Systematic Comparison of
Various Statistical Alignment Models,” Computational
Linguistics, vol. 29, no. 1, pp. 19–51, 2003.
[13] P. F. Brown, S. A. D. Pietra, V. J. D. Pietra, and
R. L. Mercer, “The Mathematics of Statistical Machine Translation: Parameter Estimation.” Computational Linguistics, vol. 19, no. 2, pp. 263–311, 1993.
[14] S. Vogel, H. Ney, and C. Tillmann, “HMM-based Word
Alignment in Statistical Translation.” in COLING 96,
Copenhagen, 1996, pp. 836–841.
[15] J. Niehues and S. Vogel, “Discriminative Word Alignment via Alignment Matrix Modeling.” in Proc. of
Third ACL Workshop on Statistical Machine Translation, Columbus, USA, 2008.
[16] “Acl 2008 Third Workshop on Statistical Machine
Translation,”
http://www.statmt.org/wmt08/.
[17] P. Koehn and K. Knight, “Empirical Methods for Compound Splitting,” in EACL, Budapest, Hungary, 2003.
[3] “Cepstral,”
http://www.cepstral.com/.
[4] M. Wölfel and J. McDonough, “Minimum Variance
Distortionless Response Spectral Estimation,” IEEE
Signal Processing Magazine, vol. 22, no. 5, pp. 117–
126, Sept. 2005.
[5] M. J. F. Gales, “Semi-Tied Covariance Matrices for
Hidden Markov Models,” IEEE Transactions Speech
and Audio Processing, vol. 7, pp. 272–281, 1999.
[6] “Open Source Spell Checking, Stemming, Morphological Analysis and Generation.” 2006,
http://hunspell.sourceforge.net.
[18] P. Koehn and C. Monz, “Manual and Automatic Evaluation of Machine Translation between European Languages.” in NAACL 2006 Workshop on Statistical Machine Translation, New York, USA, 2006.
[19] K. Rottmann and S. Vogel, “Word Reordering in Statistical Machine Translation with a POS-Based Distortion
Model,” in TMI, Skövde, Sweden, 2007.
[20] H. Schmid, “Probabilistic Part-of-Speech Tagging Using Decision Trees,” in International Conference on
New Methods in Language Processing, Manchester,
UK, 1994.
[7] A. Stolcke, “SRILM – An Extensible Language Modeling Toolkit.” in Proc. of ICSLP, Denver, Colorado,
USA, 2002.
[21] E. Matusov, D. Hillard, M. Magimai-Doss, D. HakkaniTür, M. Ostendorf, and H. Ney, “Improving Speech
Translation by Automatic Boundary Prediction.” in
Proc. of Interspeech, Antwerp, Belgium, 2007.
[8] “Scripts for Web Data Collection provided by University of Washington,”
http://ssli.ee.washington.edu/projects/ears/WebData/
web data collection.html.
[22] C. Fügen and M. Kolss, “The Influence of Utterance
Chunking on Machine Translation Performance.” in
Proc. of Interspeech, Antwerp, Belgium, 2007.
[9] S. Ochs, M. Wölfel, and S. Stüker, “Verbesserung der
automatischen Transkription von englischen Wörtern in
deutschen Vorlesungen,” Proc. of ESSV, 2008.
[23] M. Paulik, S. Rao, I. Lane, S. Vogel, and T. Schultz,
“Sentence Segmentation and Punctuation Recovery for
Spoken Language Translation,” in ICASSP, Las Vegas,
Nevada, USA, April 2008.
[10] M. Wölfel, “A Joint Particle Filter and Multi-Step Linear Prediction Framework to Provide Enhanced Speech
Features Prior to Automatic Recognition,” in Proc. of
HSCMA, 2008.
[24] M. Kolss, S. Vogel, and A. Waibel, “Stream Decoding for Simultaneous Spoken Language Translation.” in
Proc. of Interspeech, Brisbane, Australia, 2008.
- 181 -
Download