
The Right Data for MT
Olga Beregovaya, PROMT
Kerstin Bier, Sybase
Melissa Biggs, Oracle
Karen R. Combe, PTC
Jessica Roland, EMC
Problem data for MT training
SMT experience
RBMT experience
Controlled Authoring – Lessons Learned
Controlled Authoring – Signs of Success
Problematic Data
Issue: Excessive number of internal
You can use {1}{2}File{3}{4}Instance
Index{7}{8}{9}{10}File{11}{12}Instance Operations{13}
{14}Accelerator Options{15}{16} (which opens the
{17}Instance Accelerator{18} dialog box) to perform
most instance operations.
Pour effectuer la plupart de ces tâches, vous pouvez
utiliser {1}{2}Fichier (File){3}{4}Traitement des
instances (Instance Operations){5}{6}Actualiser
l'index (Update Index){7}{8} ou {9}{10}Fichier
(File){11}{12}Traitement des instances (Instance
Operations){13}{14}Options d'accélérateur
(Accelerator Options){15}{16} afin d'ouvrir la boîte
de dialogue {17}Accélérateur d'instances (Instance
Issue: Irrelevant data
English: 0.31%
French: 0,31 %
English: &*
French: &èce*
English: (Windows
French: (Windows
Issue: homonyms
Bracket #1 (gousset): An overhanging member that
projects from a structure (as a wall) and is usually
designed to support a vertical load or to strengthen
an angle.
bracket #2 (crochet): The bracket character, such as
[ or (.
English: This figure shows that after midsurface
compression, the resulting model develops a gap
between the collet and the bracket.
French: Cette figure montre qu'après la compression
en feuillet moyen, le modèle obtenu crée un jeu entre
le collet et le gousset.
English: All data in brackets [] are optional.
French: Toutes les données entre crochets [] sont
Issue: Acronyms spelled out in the
English: You cannot propagate SDTAEs and DTAEs in a
French: Vous ne pouvez propager ni des éléments
d'annotation d'étiquette de référence ni des éléments
d'annotation de référence de positionnement à
l'intérieur d'une FARP.
Issue: Mismatching number of
English: You can have multiple entries for the same
pipe size in the bend file, that is, a single pipe
size can have multiple bend radius values associated
with it, as shown in the following example of a bend
French: Vous pouvez avoir plusieurs entrées pour la
même taille de tuyau dans le fichier de pliage. En
d'autres termes, une même taille de tuyau peut être
associée à plusieurs valeurs de rayon de pliage,
comme dans le fichier de pliage d'exemple suivant.
Issue: Inconsistent double quote usage
For example, if you create a part with the name
bracket, it initially saves to the file name {1}.
Ainsi, si vous créez une pièce portant le nom
"bracket", elle est tout d'abord enregistrée dans le
fichier {1}.
Issue: Entity mismatch
English: One way is to create a "flexible
French: Une méthode consiste à créer un modèle
Issue: Punctuation mismatch (brace vs.
English: {1}Copy as Skeleton{2} (the option cannot be
changed) to create a skeleton model.
French: Cliquez sur {1}Copier en tant que squelette
(Copy as Skeleton){2} - option non modifiable - pour
créer un modèle squelette.
Issue: Punctuation mismatch (dash vs.
English: {1}Additional Rotation{2} — Enter a realnumber value for the number of degrees to rotate the
spring's Y axis.
French: {1}Rotation supplémentaire (Additional
Rotation){2} : entrez un nombre réel pour indiquer le
nombre de degrés de rotation de l'axe Y du ressort.
Issue: Capitalization mismatch
English: Piping Master Catalog Directory File
French: Fichier répertoire du catalogue principal de
Issue: English UI strings in the
English: Click View > Color and Appearance to create
or modify colors.
Cliquez sur Affichage (View) > Couleur et apparence
(Color and Appearance) pour créer ou modifier les
The right data for MT
An SMT experience
Kerstin Bier
Getting started with MT:
The Sybase SMT experience(s)
 Engine: Moses
 Add-On: PangeaMT parser for inline markup in output
 Initial language pair: EN -> DE
 Data volume for training: 5 million words
 small data volume, but we do not have more
 our own data to have better control
 MT (and post-editing) in use for documentation
localization for ca. 2 months now
Getting the data right:
Automated cleaning and preparation
TMX data
Bilingual XML with
inline tags/markup
XML entities like
©   etc.
Invalid characters
Two plain
text files
Two aligned text
files, no tags,
HOUSE è house
Moses Cleanup:
By default, è
By default ,
Empty lines
Sentence ratio wrong
MT engine training
<ph> etc.
Got the right data?
 Pilot project results: High BLEU score, good
productivity test è „good“ data?
 Restricted domain
 Consistent style (authoring effort!)
 Consistent terminology (we thought)
 Results of „real world“ MT usage confirms results:
 productivity > 25 % - 300% compared to baseline
 analysis revealed some issues in training data with
effect on output
Three main data issues:
A problem (not only) for MT
Inline Markup
 Inline content translation
(translate vs. notranslate)
 UI references
(Do Not Translates)
 Do Not Translates:
 domain-specific terms
 sample output and more...
Source and target
 complex sentences
 inconsistencies
 ambiguity
MT issue: Inline content
Problems in training data:
 XML tags are removed
 Loss of context information (e.g. DNT)
 Protected (notranslate) inline content
 Removal of tags incl. content = gaps
in training sentences
Inline Content
Output results:
 Incorrectly translated inline content
 Output quality degraded
Possible solutions:
 Amend training data:
 Restore content
 Use placeholders
 Pre-process input (add XML markup)
MT issue: UI References
Problems in training data:
 XML tags removed
 Loss of UI information
 UI strings are „zones“, do not fit in
sentence structure
UI References
Output results:
 Many incorrect UI string translations
 Weird translations in some places
Possible solutions:
 More training data?
 More promising:
 Handle UI references outside MT
MT issue: Do Not Translates
Problems in training data:
 Loss of DNT information
 Lower-casing (SELECT => select)
 Tokenization (sp_proc => sp _ proc)
 Many untranslated words in corpus
(Do Not Translates)
Output results:
 DNTs translated
 English words in „translate“ contexts
Possible solutions:
 Customize lower-casing (=> truecasing)
 Customize tokenizer
 Pre-processing
MT issue: Source and target issues
Problems in training data::
 Long, complex sentences (source)
 Inconsistent wording/terms (both)
 Ambiguities, omissions (source)
 Translation too „creative“, too „free“
Source and target
Output results:
 Quality degradation up to useless MT
Possible solutions:
 Source: Pre-editing, authoring control tool
 Target: Translation control (authoring
control for target side)
 TMs (TMX) are usually a good basis for SMT training
 Automated cleaning takes out most of the „dirt“
 MT output improvements can be achieved by:
 Improving the source - authoring control/pre-editing
 Improving the target - translation control
 Extensive terminology work (source and target)
 Pre- and post-processing steps
 For many special output requirements, it makes
more sense to invest time in pre-processing and
post-processing steps than in the training data
RBMT – Pre-processing
terminology and metadata
Olga Beregovaya
Preprocessing of Glossaries
Glossaries are one of the best ways to create a dictionary, but most of the glossaries
provided by customers need to be preprocessed. Preprocessing includes extracting:
•segments with and without translation
•segments with correct and “incorrect” translation (for example, translation with
comments in brackets)
•segments where the source is equal to the target (proper names)
•segments with special characters
•segments in upper cases, lower cases and mixed cases (comparing them and
separating the common and unique strings)
Standard TM
verification/normalization process
During TM verification the following is
addressed through automatic steps
• Irregular characters gets flagged and replaced
• Incomplete sentences get flagged
• Punctuation suspects get flagged
• UI strings and other irregular sentences get
added to phrase tables
Handling internal tags – not excessive but useful
Original Source Segment in File Check <codeph class="+ topic/ph prd/codeph>NativeApplication.supportsSystemTrayIcon</codeph> to determine
whether system tray icons are supported on the current system.
Converted to GMS Segment format (after GMS-native segmentation)
Check {1}NativeApplication.supportsSystemTrayIcon{2} to determine whether
system tray icons are supported on the current system.
Pre‐Processed String in XLIFF Segment format is sent to PROMT.
Check <ph i=1 x=”<codeph class="+ topic/ph pr-d/codeph"
>”>{1}</ph> NativeApplication.supportsSystemTrayIcon<ph i=2
&lt/codeph>>{2}</ph> to determine whether system tray icons are supported on
the current system.
Format of the translated XLIFF Segment returned by PROMT to GMS
Проверить <ph i=1 x=”<codeph class="+ topic/ph pr-d/codeph"
>”>{1}</ph> NativeApplication.supportsSystemTrayIcon<ph i=2
&lt/codeph>>{2}</ph> для определения системном трее иконки
поддерживает нынешнюю систему.
GMS Integration with XLIFF Connector
– Why is metadata so Important?
Handling irrelevant data
• Scenario 1: We can leave the irrelevant data
untouched and let it propagate from TM or be
handled through special formatting rules
• Scenario 2: We will normalize it and add to the
phrase table
Our system will perform well in either scenario
and our course of action needs to be the
clients call
Handling homonyms
• PROMT system is specially tailored to handle
one-to-many translations and homonymy
• PROMT approach is to create context-based
dictionary entries, whether single words or
MWE which allows the system to properly
indentify the correct translation for
ambiguous entries
• PROMT also uses XML metadata when
assigning a semantic class to an entry
Handling expanding acronyms
• PROMT system handles expansion of
acronyms or different acronyms between
languages through creating explicit mapping
• This is a rather standard task in the process of
PROMT engine customization, along with
DoNotTranslate and variable lists
• Should an abbreviation or the expanded
version change, this can be fixed through the
client interface in a matter of seconds
Handling locale-specific punctuation
• Quotation mark usage for a specific small
group of terms can be defined on a dictionary
• If the use of quotation marks or other
punctuation is universal for a specific locate it
will be defined on the linguistic rules level
Handling Entity and Capitalization
• The differences in locale setting for Entities
and Capitalization rules are already pre-built
in the baseline engine and are regulated
through regional settings in the product
• All additional differences between locales are
learnt from the TM during the engine
customization phase and then are added to
the client profile template
Suggestion for UI string handling
• All the UI strings will be automatically added
to DoNotTranslate lists when appearing in the
appropriate context
• The context can be detected semantically,
though formatting and punctuation
PROMT handling of internal markup
• This step is not necessary for PROMT
translation process
• Scenario 1: the markup is handled by PROMT
TMX Level 2 extensive TM metadata support
• Scenario 2: if we need to create phrase table
entries from these strings we will normalize,
but the markup will still be preserved in the
translation process
PROMT handling of empty fields
• Scenario 1: “Red flag”: During TM verification
an automatic script will render a warning
message and the empty unit will not be
• Scenario 2: We also can send the empty
segment to the customized engine and obtain
a translation which will be propagated into the
TM for further verification
Olga Beregovaya
• Pre-editing - preprocessing the source language before it is sent to
automated translation. Typical modifications of the source language
include reducing complexity and ambiguity to achieve a more fluent
automated translation.
• Normalization (in this context) – pre-processing of marked-up data to
train MT systems
Examples of incorrect translation caused by poor source
English > Spanish
Are you going to school
Son usted yendo a la escuela
Are you going to school?
¿Va usted a la escuela?
German > English
wie funktioniert das übersetzen
mit dem “clipboard”?
Wie funktioniert das
Übersetzen mit dem “clipboard”?
how does this function translate
with “clipboard”?
How does the translation with “clipboard”
Russian > English
Я часто использую это ПО
I frequently use it ON
Я часто использую это
программное обеспечение
I frequently use this software
PROMT-specific pre-editing tips
For best translation quality the following clauses are to be avoided in the source:
Adjacent identical clauses (with standard and non-standard passive – i.e. “he was
asked and helped”); similar participles are not always analyzed as such, always a
good practice to repeat an ancillary verb
“When asked” and similar clauses, a full sentence is always better
Ellipses and all other types of incomplete sentences, including sentences like “I
have a suspicion he can be late today”, a good practice to always add “ that”
Missing articles and determinatives when homonymy needs to be parsed
Postposition participles, such as “the problems discussed”
Incorrect punctuation, including incorrectly used hyphens (hyphen used instead of
an em-dash); an expression with a hyphen will be parsed as a single word
Other possible sources of PROMT errors:
The following errors need to be corrected in the customer profile, then files need to
be re-translated:
• Morphological errors: Incorrect morphology in the target may be caused by
incorrect morphological attributes in your dictionary, check the attributes using
PROMT Dictionary Editor
• Proper names, brand names and alike are translated: add them to the
DoNotTranslate list
• Incorrect syntax in the target may be caused by incorrect markup parsing rules:
check your filters and rules settings
Controlled Authoring
Melissa Biggs
“Technical” Challenges for Authoring
Tools Adoption
Diverse authoring tools and styles
Multiple and wide range of authors/groups in an
Lack of process, measurement methodology and
corporate accountability in authoring communities
Standalone use (lack of architecture to produce
automated process for full lifecycle -> editing,
publishing, translation)
“Cultural” Pitfalls for Authoring Tools
Multiple and wide range of authors/groups in an
Resistance by authors to a “control” tool
Lack of interest by content creator as the full
benefits may not visible to the creator
Challenge in defining a clear ROI definition
Standalone tool use (lack of architecture to
produce automated process for full lifecycle ->
editing, publishing, translation)
Case Study (pre-MT)
Globalization group purchases SW license
for authoring tool
G11n group drives adoption in pubs groups;
provides training, support, assistance with rules
Implemented and mandated for use by 1
publications group using SGML authoring
Demos, but no traction or acceptance, in 4
additional publications groups + marketing
No metrics or tracking implemented by pubs
Decreased acceptance of use of tool over time
Case Study: The Tool
Supports application of a common style through
a rule set which results in
Clean and structured source
Consistent terminology across the document
(less confusion and higher user satisfaction)
Optimizing the maintenance of information
Improved search and retrieval of information
Applying rules via tool helped to create a clean
and structured translation source – important for
implementing machine translation
Case Study: The Tool
English documentation processed using tool =
easier and faster translation (for x target
languages 1 ambiguity in the source generates
x queries during translation cycle)
Reduced translation cycle, faster time-to-market
Fewer ambiguities in the source => more
accurate and consistent translations => higher
customer satisfaction
Case Study - Results
Increased content reuse for both English and
translated content
Limits in ability to scale (increase) content and
increase quality
Editor time not reduced, but less focus on minor,
repetitive errors
Decreasing acceptance of use of tool over time
– Value proposition not compelling in Pubs
– Cross - product savings/benefits not visible
– Publications measurements/metrics not
tracked consistently
Globalization team viewed as an enforcer
Learnings: It's the Culture, not the Tool
Define the total value proposition + process
chain for the tool
Include Terminology, localization/translation
Define and administer a Content LifeCycle
•Include Critical phase for “pre-editing”
Find right central ownership for authoring tool
Not a standalone technology/process
Simultaneous adoption may scale more
effectively than group-by-group adoption
Engage with globalization early
Accountability to management -- products
Continuous Scorecard reporting
Controlled Authoring
Jessica Roland
EMC Information Intelligence Group
Controlled Source - Pro
•Acquired Controlled Authoring tool in 2008
• Compared two market leaders
•Influenced by IT peer company references
• 86% of writers have access
•Current focus: spelling, grammar, style
Controlled Source - Pro
•Positive feedback from writers
“I did run the tool, and to my shock and
amazement, it found lots of stuff”
“And I thought I didn't use passive voice much!”
“I'm finding it very helpful! has flagged
passive constructions that I was too lazy or timecrunched to fix before, as well as a number of
other "gotchas" that simply take a little more
time to reconsider.”
Controlled Source - Pro
•Before and after reports - results and scores
•Measurable improvement in grammar and style
•Need intelligent reuse module for word count
• Lesson learned: Get the IR module right away
Controlled Source - Pro
•Careful with changes to legacy content during
•Process with writers:
•Check legacy content after last drop or post-release
• Check discrete new feature content and improve
iteratively – it’s relatively small
•Run before/after metrics on whole book, after the
last drop to L10N
Controlled Source - Pro
•MT is only used with documentation
•Observing MT savings increase since tool
deployment, even without IR
•MT likes cleaner text
•Greater savings by hour than by word
Controlled Source – Pro
•Acquired Controlled Authoring tool in 2008
•Positive feedback from writers
•Need intelligent reuse module for word count
•Careful with changes to legacy TM
•Observing MT savings increase since tool
Metrics and recommendations
What is good data for MT?
General content pre-editing tips
Good source = Good machine translation
Flawed source = $%#@!
• Check your spelling, including upper/lower case
• Check for proper punctuation
• Use diacritics correctly
• Use simple syntactic constructions
• Do not omit syntactic words
• Use conventional abbreviations
• Avoid slang
Data normalization tips
• Identify your data problematic issues and ways of
addressing them either through pre-editing tools or your
MT engine pre-processing capabilities
• Decide what needs to be addressed through automated
processing and what can be left to post-editors to correct
• Sometimes your preferred formatting and markup can be
in conflict with MT engine’s logic – quotes, brackets,
capitalization are not MT’s best friends. Be prepared to
choose your battles
Result – successful MT deployment
• Automated metric scores, i.e. BLEU/Methor scores
will double with engine trained on clean data and/or
good terminology
• Post-editors are able to concentrate on polishing the
language rather than dealing with omissions,
incorrect terminology, mystery tags
• Time to market can be reduced by 25 to 40 percent
• Translation costs can be reduced by approximately
25 to 40 percent