Language resources and their commercial applications Kara Warburton kara@ca.ibm.com My aim Demonstrate the value of language resources for commercial applications Discuss why standards for language resources are important Present TC37 as a standards-developing organization Warning – slight terminology bias! ISO/TC 37 Terminology and other language and content resources Managing language resources A language resource is Information expressed in a natural language Information that supports the interpretation of natural language Language resources can enhance business processes If properly deployed Requires interoperability, which in turn requires standards. ISO/TC 37 Terminology and other language and content resources Why me? Implemented terminological resources, lexical resources, and standards for content interoperability in business environments - Terminologist for IBM, LISA contributor, business consultant Developed standards and best practices for language resources: ISO TC37, LISA Practical experience as a technical writer and translator – using language resources in increasingly technical environments ISO/TC 37 Terminology and other language and content resources The cold reality The computer age has generated exponential growth of information and knowledge. Even with the aid of computers, we can’t manage this volume of information. Why? Computers can’t understand “natural” language. They only understand “1” and “0”. Natural language is largely unstructured; even many structured language resources are “unpredictably” structured. This environment demands increasing volumes of structured language resources to enable next-generation computing ISO/TC 37 Terminology and other language and content resources Business scenarios for managing language resources Translation memories Terminologies and lexical resources for enhancing NLP applications Content management and retrieval Content repurposing Content classification Normalized language Keyword management Example: term extraction tool – use of “layered” lexical resources; grammatical rules; ranking algorithms ISO/TC 37 Terminology and other language and content resources Managing terminology supports both social and commercial interests Economic/commercial: Control terminology to ensure quality and minimize production costs. Build terminological resources that are repurposable across the content management chain. Increase competitiveness in local and global markets. Social/geopolitical: Strengthen and protect minority languages. Support cultural diversity. Increase global presence and visibility. ISO/TC 37 Terminology and other language and content resources Is “managed” terminology really important for a business? In the automotive industry, almost 50% of translation errors are “wrong term” (Woyde) 40% of time required for text production is terminology work (Stellbrink) Between 30% and 70% of errors in technical documentation are terminology errors (Schutz, and MULTIDOC) Terminology work is necessary for between 4% and 6% of all words in a text (Champagne) Return on investment: 10% ($100 investment yields $110 return) (Champagne) Outsourced translations may be 50% more expensive if source terminology is inconsistent (Kjeldgaard) ISO/TC 37 Terminology and other language and content resources Need more proof? Terminology tools increase productivity by approx. 20% (Champagne) Without a central reference, each needless search can take 20 to 30 minutes (Champagne) It costs 10 times more to fix a term at the end of the production cycle than at the beginning (Xerox, JDEdwards) Inconsistent or inaccurate terminology raises service costs Terminology mistakes can lead to lawsuits for copyright or trademark infringement, or for damages due to defective products or incorrect user documentation. ISO/TC 37 Terminology and other language and content resources IBM scenario… “Terminology work is necessary for between 4% and 6% of all words in a text (Champagne)” 429 million words are translated per year in IBM. Thus over 21 million words require attention. In 2009, IBM “processed” over 160,000 terms as part of the “content conveyor belt”, in nearly 3,000 specialized “dictionaries”. Very small staff High degree of automation ISO/TC 37 Terminology and other language and content resources What “measures” need to be taken? Deploy a terminology database that serves multiple purposes Integrate the database into all content environments to ensure a “push” mechanism Respect data management principles, such as data granularity, elementarity, etc. Adopt best practices for terminology, such as term autonomy and concept orientation Allow for extensibility for features such as morphology as needed for future applications ISO/TC 37 Terminology and other language and content resources Basic example – repurpose information <h1>CI revision conflicts</h1> <p>When revising your <term keyref="ci">CI</term>, to avoid conflicts...</p> <glossentry id="ci"> <glossterm>configuration item</glossterm> <glossdef>An entity in a configuration that satisfies an end-use function and can be uniquely identified.</glossdef> <glossBody> <glossAlt> <glossAcronym>CI</glossAcronym> </glossAlt> </glossBody> </glossentry> ISO/TC 37 Terminology and other language and content resources Controlled authoring ISO/TC 37 Terminology and other language and content resources Controlled translation ISO/TC 37 Terminology and other language and content resources Search – Query expansion ISO/TC 37 Terminology and other language and content resources Search – Query expansion ISO/TC 37 Terminology and other language and content resources Source data… ISO/TC 37 Terminology and other language and content resources Search – Query correction ISO/TC 37 Terminology and other language and content resources Synonyms/inconsistencies multiply in the target language – this is bad for business automatic memory reclamation remise en état automatique du mémoire automatic storage reclamation remise en état automatique de l’archivage récupération automatique de mémoire remise en état automatique du stockage récupération automatique de l’archivage récupération automique du stockage garbage collection récupération de place vidage de la corbeille récupération de place en mémoire récupération de positions inutilisées récupération de l’espace mémoire ISO/TC 37 Terminology and other language and content resources “Cosmetic” differences can become more than cosmetic in the target language 1. 2. 3. administration console admin console administrative console 1. pupitre d’administration 2. console d’administration 3. pupitre admin 4. console admin 5. pupitre administratif 6. console administrative ISO/TC 37 Terminology and other language and content resources Explosion of affected compounds… administrative console application / administration console application administrative console button / administration console button administrative console login page / administration console login page core administrative console / core administration console .... ISO/TC 37 Terminology and other language and content resources Fixing the problem isn’t easy… Change “pupitre” to “console”….. Le pupitre administratif est ouvert. Vous devez le fermer. La console administrative est ouverte. Vous devez la fermer. ISO/TC 37 Terminology and other language and content resources Development of terminology resources is also key for language planning Prescriptive terminology approach – just like in enterprise environments The Canadian experience: Termium, the BTQ Other examples: Danterm, Korterm, Eurotermbank Termbases feed into widely-distributed bulletins and other distribution media to support adoption and language reinforcement As an educational resource For social and political policies As an aid to commerce ISO/TC 37 Terminology and other language and content resources Effective management of language resources requires adherence to standards and best practices ISO/TC 37 Terminology and other language and content resources ISO/TC 37 Terminology and other language and content resources ISO/TC 37 Terminology and other language and content resources Interoperability requires adherence to standards Interoperability between tools and applications: CAT tools vs controlled authoring, Web interfaces, GMS, ECM, search engines… Interoperability between users – writers, translators, publicists For delivering derivative products – glossaries, Web sites, etc. For different purposes – learning, commercialization, government, social services, language planning, tourism, etc. For different media – online vs paper, hand-helds, transport interfaces, broadcasting media, marketing collateral, etc. ISO/TC 37 Terminology and other language and content resources Standards at various levels Data transfer File format File structure (data model) Encoding Markup Syntax Semantics ISO/TC 37 Terminology and other language and content resources ISO TC37 – Terminology and other language and content resources Standardization of principles, methods and applications relating to terminology and other language and content resources in the contexts of multilingual communication and cultural diversity. Web site… ISO/TC 37 Terminology and other language and content resources TC37 Current focus areas Word segmentation Language annotations to facilitate machine processing Terminology policies Translation quality Simultaneous interpretation Data categories - www.isocat.org XML representation and exchange formats Persistent identifiers in multilingual environments ISO/TC 37 Terminology and other language and content resources Key standards and best practices – ISO TC37 ISO 30042 – TBX ISO 16642 – TMF ISO 12620 new ISO TC37 Data Category Registry ISO Concept Database ISO 704 – Terminology work: Principles and methods ISO 12616 – Translation-oriented terminography ISO 26162 – Design, implementation and maintenance of terminology management systems Annotation schemes and frameworks (SC4) ISO/TC 37 Terminology and other language and content resources Training professionals in language resource management – an opportunity! Lack of university training programs Lack of competency in existing fragmented university courses Increasing demand for qualified professionals For example, LISA offered 6 workshops (there was a demand for more) - 73 companies attended. TermNet summer school – attendance grows each year ISO/TC 37 Terminology and other language and content resources Thank you! ISO/TC 37 Terminology and other language and content resources