The Right Data for MT Olga Beregovaya, PROMT Kerstin Bier, Sybase Melissa Biggs, Oracle Karen R. Combe, PTC Jessica Roland, EMC Agenda Introduction Problem data for MT training SMT experience RBMT experience Pre-Editing Controlled Authoring – Lessons Learned Controlled Authoring – Signs of Success Recommendations Problematic Data Karen R. Combe, PTC Issue: Excessive number of internal tags You can use {1}{2}File{3}{4}Instance Operations{5}{6}Update Index{7}{8}{9}{10}File{11}{12}Instance Operations{13} {14}Accelerator Options{15}{16} (which opens the {17}Instance Accelerator{18} dialog box) to perform most instance operations. Pour effectuer la plupart de ces tâches, vous pouvez utiliser {1}{2}Fichier (File){3}{4}Traitement des instances (Instance Operations){5}{6}Actualiser l'index (Update Index){7}{8} ou {9}{10}Fichier (File){11}{12}Traitement des instances (Instance Operations){13}{14}Options d'accélérateur (Accelerator Options){15}{16} afin d'ouvrir la boîte de dialogue {17}Accélérateur d'instances (Instance Accelerator){18} 4 Issue: Irrelevant data English: 0.31% French: 0,31 % English: &amp;amp;asm.mbr.name==part* French: &amp;asm.mbr.name==pièce* English: (Windows NT/95/98/2000)D:\partlib\{1}\objects French: (Windows NT/95/98/2000)D:\partlib\{1}\objects 5 Issue: homonyms Bracket #1 (gousset): An overhanging member that projects from a structure (as a wall) and is usually designed to support a vertical load or to strengthen an angle. bracket #2 (crochet): The bracket character, such as [ or (. English: This figure shows that after midsurface compression, the resulting model develops a gap between the collet and the bracket. French: Cette figure montre qu'après la compression en feuillet moyen, le modèle obtenu crée un jeu entre le collet et le gousset. English: All data in brackets [] are optional. French: Toutes les données entre crochets [] sont facultatives. 6 Issue: Acronyms spelled out in the target English: You cannot propagate SDTAEs and DTAEs in a DTAF. French: Vous ne pouvez propager ni des éléments d'annotation d'étiquette de référence ni des éléments d'annotation de référence de positionnement à l'intérieur d'une FARP. 7 Issue: Mismatching number of sentences English: You can have multiple entries for the same pipe size in the bend file, that is, a single pipe size can have multiple bend radius values associated with it, as shown in the following example of a bend file. French: Vous pouvez avoir plusieurs entrées pour la même taille de tuyau dans le fichier de pliage. En d'autres termes, une même taille de tuyau peut être associée à plusieurs valeurs de rayon de pliage, comme dans le fichier de pliage d'exemple suivant. 8 Issue: Inconsistent double quote usage For example, if you create a part with the name bracket, it initially saves to the file name {1}. Ainsi, si vous créez une pièce portant le nom "bracket", elle est tout d'abord enregistrée dans le fichier {1}. 9 Issue: Entity mismatch English: One way is to create a &amp;quot;flexible model. French: Une méthode consiste à créer un modèle souple. 10 Issue: Punctuation mismatch (brace vs. dash) English: {1}Copy as Skeleton{2} (the option cannot be changed) to create a skeleton model. French: Cliquez sur {1}Copier en tant que squelette (Copy as Skeleton){2} - option non modifiable - pour créer un modèle squelette. 11 Issue: Punctuation mismatch (dash vs. colon) English: {1}Additional Rotation{2} — Enter a realnumber value for the number of degrees to rotate the spring's Y axis. French: {1}Rotation supplémentaire (Additional Rotation){2} : entrez un nombre réel pour indiquer le nombre de degrés de rotation de l'axe Y du ressort. 12 Issue: Capitalization mismatch English: Piping Master Catalog Directory File French: Fichier répertoire du catalogue principal de tuyauterie 13 Issue: English UI strings in the translation English: Click View > Color and Appearance to create or modify colors. Cliquez sur Affichage (View) > Couleur et apparence (Color and Appearance) pour créer ou modifier les couleurs. 14 The right data for MT An SMT experience Kerstin Bier Sybase Getting started with MT: The Sybase SMT experience(s) Engine: Moses Add-On: PangeaMT parser for inline markup in output Initial language pair: EN -> DE Data volume for training: 5 million words small data volume, but we do not have more our own data to have better control MT (and post-editing) in use for documentation localization for ca. 2 months now Getting the data right: Automated cleaning and preparation TMX data Bilingual XML with inline tags/markup Cleanup: Entities Conversion XML entities like &copy; &nbsp; etc. Cleanup: Characters Cleanup: Tags Invalid characters Two plain text files Two aligned text files, no tags, lower-cased Lowercasing Example: HOUSE è house Tokenization Moses Cleanup: Segments Example: By default, è By default , Empty lines Sentence ratio wrong MT engine training Remove: <ph> etc. Got the right data? Pilot project results: High BLEU score, good productivity test è „good“ data? Restricted domain Consistent style (authoring effort!) Consistent terminology (we thought) Results of „real world“ MT usage confirms results: productivity > 25 % - 300% compared to baseline BUT: analysis revealed some issues in training data with effect on output Three main data issues: A problem (not only) for MT Inline Markup Content Issues: Inline content translation (translate vs. notranslate) UI references Issues: DNTs (Do Not Translates) Do Not Translates: domain-specific terms sample output and more... Issues: Source and target issues complex sentences inconsistencies ambiguity MT issue: Inline content Problems in training data: XML tags are removed Loss of context information (e.g. DNT) Protected (notranslate) inline content Removal of tags incl. content = gaps in training sentences Inline Content Output results: Incorrectly translated inline content Output quality degraded Possible solutions: Amend training data: Restore content Use placeholders Pre-process input (add XML markup) MT issue: UI References Problems in training data: XML tags removed Loss of UI information UI strings are „zones“, do not fit in sentence structure UI References Output results: Many incorrect UI string translations Weird translations in some places Possible solutions: More training data? More promising: Handle UI references outside MT MT issue: Do Not Translates Problems in training data: Loss of DNT information Lower-casing (SELECT => select) Tokenization (sp_proc => sp _ proc) Many untranslated words in corpus DNTs (Do Not Translates) Output results: DNTs translated English words in „translate“ contexts Possible solutions: Customize lower-casing (=> truecasing) Customize tokenizer Pre-processing MT issue: Source and target issues Problems in training data:: Long, complex sentences (source) Inconsistent wording/terms (both) Ambiguities, omissions (source) Translation too „creative“, too „free“ Source and target issues Output results: Quality degradation up to useless MT output Possible solutions: Source: Pre-editing, authoring control tool Target: Translation control (authoring control for target side) Summary TMs (TMX) are usually a good basis for SMT training Automated cleaning takes out most of the „dirt“ MT output improvements can be achieved by: Improving the source - authoring control/pre-editing Improving the target - translation control Extensive terminology work (source and target) Pre- and post-processing steps For many special output requirements, it makes more sense to invest time in pre-processing and post-processing steps than in the training data RBMT – Pre-processing terminology and metadata Olga Beregovaya PROMT Preprocessing of Glossaries Glossaries are one of the best ways to create a dictionary, but most of the glossaries provided by customers need to be preprocessed. Preprocessing includes extracting: •segments with and without translation •segments with correct and “incorrect” translation (for example, translation with comments in brackets) •segments where the source is equal to the target (proper names) •segments with special characters •segments in upper cases, lower cases and mixed cases (comparing them and separating the common and unique strings) Standard TM verification/normalization process During TM verification the following is addressed through automatic steps • Irregular characters gets flagged and replaced • Incomplete sentences get flagged • Punctuation suspects get flagged • UI strings and other irregular sentences get added to phrase tables 27 Handling internal tags – not excessive but useful • • • • • • • Original Source Segment in File Check <codeph class="+ topic/ph prd/codeph>NativeApplication.supportsSystemTrayIcon</codeph> to determine whether system tray icons are supported on the current system. Converted to GMS Segment format (after GMS-native segmentation) Check {1}NativeApplication.supportsSystemTrayIcon{2} to determine whether system tray icons are supported on the current system. Pre‐Processed String in XLIFF Segment format is sent to PROMT. Check <ph i=1 x=”&lt;codeph class=&quot;+ topic/ph pr-d/codeph&quot; &gt;”>{1}</ph> NativeApplication.supportsSystemTrayIcon<ph i=2 &lt/codeph&gt;>{2}</ph> to determine whether system tray icons are supported on the current system. Format of the translated XLIFF Segment returned by PROMT to GMS Проверить <ph i=1 x=”&lt;codeph class=&quot;+ topic/ph pr-d/codeph&quot; &gt;”>{1}</ph> NativeApplication.supportsSystemTrayIcon<ph i=2 &lt/codeph&gt;>{2}</ph> для определения системном трее иконки поддерживает нынешнюю систему. GMS Integration with XLIFF Connector – Why is metadata so Important? Handling irrelevant data • Scenario 1: We can leave the irrelevant data untouched and let it propagate from TM or be handled through special formatting rules • Scenario 2: We will normalize it and add to the phrase table Our system will perform well in either scenario and our course of action needs to be the clients call 30 Handling homonyms • PROMT system is specially tailored to handle one-to-many translations and homonymy • PROMT approach is to create context-based dictionary entries, whether single words or MWE which allows the system to properly indentify the correct translation for ambiguous entries • PROMT also uses XML metadata when assigning a semantic class to an entry 31 Handling expanding acronyms • PROMT system handles expansion of acronyms or different acronyms between languages through creating explicit mapping • This is a rather standard task in the process of PROMT engine customization, along with DoNotTranslate and variable lists • Should an abbreviation or the expanded version change, this can be fixed through the client interface in a matter of seconds 32 Handling locale-specific punctuation • Quotation mark usage for a specific small group of terms can be defined on a dictionary level • If the use of quotation marks or other punctuation is universal for a specific locate it will be defined on the linguistic rules level 33 Handling Entity and Capitalization mismatch • The differences in locale setting for Entities and Capitalization rules are already pre-built in the baseline engine and are regulated through regional settings in the product interface • All additional differences between locales are learnt from the TM during the engine customization phase and then are added to the client profile template 34 Suggestion for UI string handling • All the UI strings will be automatically added to DoNotTranslate lists when appearing in the appropriate context • The context can be detected semantically, though formatting and punctuation 35 PROMT handling of internal markup • This step is not necessary for PROMT translation process • Scenario 1: the markup is handled by PROMT TMX Level 2 extensive TM metadata support • Scenario 2: if we need to create phrase table entries from these strings we will normalize, but the markup will still be preserved in the translation process 36 PROMT handling of empty fields • Scenario 1: “Red flag”: During TM verification an automatic script will render a warning message and the empty unit will not be propagated • Scenario 2: We also can send the empty segment to the customized engine and obtain a translation which will be propagated into the TM for further verification 37 Pre-Editing Olga Beregovaya PROMT Pre-editing Definitions: • Pre-editing - preprocessing the source language before it is sent to automated translation. Typical modifications of the source language include reducing complexity and ambiguity to achieve a more fluent automated translation. • Normalization (in this context) – pre-processing of marked-up data to train MT systems Examples of incorrect translation caused by poor source • • • English > Spanish Incorrect: Are you going to school Son usted yendo a la escuela Correct: Are you going to school? ¿Va usted a la escuela? German > English Incorrect: wie funktioniert das übersetzen mit dem “clipboard”? Correct: Wie funktioniert das Übersetzen mit dem “clipboard”? how does this function translate with “clipboard”? How does the translation with “clipboard” function? Russian > English Incorrect: Я часто использую это ПО I frequently use it ON Correct: Я часто использую это программное обеспечение I frequently use this software PROMT-specific pre-editing tips For best translation quality the following clauses are to be avoided in the source: • • • • • • Adjacent identical clauses (with standard and non-standard passive – i.e. “he was asked and helped”); similar participles are not always analyzed as such, always a good practice to repeat an ancillary verb “When asked” and similar clauses, a full sentence is always better Ellipses and all other types of incomplete sentences, including sentences like “I have a suspicion he can be late today”, a good practice to always add “ that” Missing articles and determinatives when homonymy needs to be parsed Postposition participles, such as “the problems discussed” Incorrect punctuation, including incorrectly used hyphens (hyphen used instead of an em-dash); an expression with a hyphen will be parsed as a single word Other possible sources of PROMT errors: The following errors need to be corrected in the customer profile, then files need to be re-translated: • Morphological errors: Incorrect morphology in the target may be caused by incorrect morphological attributes in your dictionary, check the attributes using PROMT Dictionary Editor • Proper names, brand names and alike are translated: add them to the DoNotTranslate list • Incorrect syntax in the target may be caused by incorrect markup parsing rules: check your filters and rules settings Controlled Authoring Melissa Biggs Oracle “Technical” Challenges for Authoring Tools Adoption Diverse authoring tools and styles Multiple and wide range of authors/groups in an Enterprise Lack of process, measurement methodology and corporate accountability in authoring communities Tracking Metrics/measures Standalone use (lack of architecture to produce automated process for full lifecycle -> editing, publishing, translation) “Cultural” Pitfalls for Authoring Tools Adoption Multiple and wide range of authors/groups in an Enterprise Resistance by authors to a “control” tool Lack of interest by content creator as the full benefits may not visible to the creator Challenge in defining a clear ROI definition Standalone tool use (lack of architecture to produce automated process for full lifecycle -> editing, publishing, translation) Case Study (pre-MT) Globalization group purchases SW license for authoring tool G11n group drives adoption in pubs groups; provides training, support, assistance with rules Implemented and mandated for use by 1 publications group using SGML authoring Demos, but no traction or acceptance, in 4 additional publications groups + marketing No metrics or tracking implemented by pubs group Decreased acceptance of use of tool over time Case Study: The Tool Supports application of a common style through a rule set which results in Clean and structured source Consistent terminology across the document (less confusion and higher user satisfaction) Optimizing the maintenance of information Improved search and retrieval of information Applying rules via tool helped to create a clean and structured translation source – important for implementing machine translation Case Study: The Tool English documentation processed using tool = easier and faster translation (for x target languages 1 ambiguity in the source generates x queries during translation cycle) Reduced translation cycle, faster time-to-market Fewer ambiguities in the source => more accurate and consistent translations => higher customer satisfaction Case Study - Results Increased content reuse for both English and translated content Limits in ability to scale (increase) content and increase quality Editor time not reduced, but less focus on minor, repetitive errors Decreasing acceptance of use of tool over time – Value proposition not compelling in Pubs – Cross - product savings/benefits not visible – Publications measurements/metrics not tracked consistently Globalization team viewed as an enforcer Learnings: It's the Culture, not the Tool Define the total value proposition + process chain for the tool Include Terminology, localization/translation Define and administer a Content LifeCycle methodology •Include Critical phase for “pre-editing” Find right central ownership for authoring tool Not a standalone technology/process Simultaneous adoption may scale more effectively than group-by-group adoption Engage with globalization early TRACK & MEASURE Accountability to management -- products Continuous Scorecard reporting Controlled Authoring Jessica Roland EMC Information Intelligence Group Controlled Source - Pro •Acquired Controlled Authoring tool in 2008 • Compared two market leaders •Influenced by IT peer company references • 86% of writers have access •Current focus: spelling, grammar, style Controlled Source - Pro •Positive feedback from writers • • • “I did run the tool, and to my shock and amazement, it found lots of stuff” “And I thought I didn't use passive voice much!” “I'm finding it very helpful!...it has flagged passive constructions that I was too lazy or timecrunched to fix before, as well as a number of other "gotchas" that simply take a little more time to reconsider.” Controlled Source - Pro •Before and after reports - results and scores •Measurable improvement in grammar and style •Need intelligent reuse module for word count reduction • Lesson learned: Get the IR module right away Controlled Source - Pro •Careful with changes to legacy content during L10N…$$$ •Process with writers: •Check legacy content after last drop or post-release • Check discrete new feature content and improve iteratively – it’s relatively small •Run before/after metrics on whole book, after the last drop to L10N Controlled Source - Pro •MT is only used with documentation •Observing MT savings increase since tool deployment, even without IR •MT likes cleaner text •Greater savings by hour than by word Controlled Source – Pro Summary •Acquired Controlled Authoring tool in 2008 •Positive feedback from writers •Need intelligent reuse module for word count reduction •Careful with changes to legacy TM •Observing MT savings increase since tool deployment Metrics and recommendations What is good data for MT? General content pre-editing tips Good source = Good machine translation Flawed source = $%#@! Recommendations: • Check your spelling, including upper/lower case • Check for proper punctuation • Use diacritics correctly • Use simple syntactic constructions • Do not omit syntactic words • Use conventional abbreviations • Avoid slang Data normalization tips • Identify your data problematic issues and ways of addressing them either through pre-editing tools or your MT engine pre-processing capabilities • Decide what needs to be addressed through automated processing and what can be left to post-editors to correct • Sometimes your preferred formatting and markup can be in conflict with MT engine’s logic – quotes, brackets, capitalization are not MT’s best friends. Be prepared to choose your battles Result – successful MT deployment • Automated metric scores, i.e. BLEU/Methor scores will double with engine trained on clean data and/or good terminology • Post-editors are able to concentrate on polishing the language rather than dealing with omissions, incorrect terminology, mystery tags • Time to market can be reduced by 25 to 40 percent • Translation costs can be reduced by approximately 25 to 40 percent