Language resources and their
commercial applications
Kara Warburton
[email protected]
My aim




Demonstrate the value of language resources for
commercial applications
Discuss why standards for language resources are
important
Present TC37 as a standards-developing organization
Warning – slight terminology bias!
ISO/TC 37 Terminology and other language and content resources
Managing language resources


A language resource is
 Information expressed in a natural language
 Information that supports the interpretation of
natural language
Language resources can enhance business processes
 If properly deployed
 Requires interoperability, which in turn requires
standards.
ISO/TC 37 Terminology and other language and content resources
Why me?



Implemented terminological resources, lexical resources,
and standards for content interoperability in business
environments - Terminologist for IBM, LISA contributor,
business consultant
Developed standards and best practices for language
resources: ISO TC37, LISA
Practical experience as a technical writer and translator –
using language resources in increasingly technical
environments
ISO/TC 37 Terminology and other language and content resources
The cold reality




The computer age has generated exponential growth of
information and knowledge.
Even with the aid of computers, we can’t manage this
volume of information. Why? Computers can’t
understand “natural” language. They only understand “1”
and “0”.
Natural language is largely unstructured; even many
structured language resources are “unpredictably”
structured.
This environment demands increasing volumes of
structured language resources to enable next-generation
computing
ISO/TC 37 Terminology and other language and content resources
Business scenarios for managing language
resources




Translation memories
Terminologies and lexical resources for enhancing NLP
applications
Content management and retrieval
 Content repurposing
 Content classification
 Normalized language
 Keyword management
Example: term extraction tool – use of “layered” lexical
resources; grammatical rules; ranking algorithms
ISO/TC 37 Terminology and other language and content resources
Managing terminology supports both social and
commercial interests


Economic/commercial:
 Control terminology to ensure quality and minimize
production costs.
 Build terminological resources that are repurposable
across the content management chain.
 Increase competitiveness in local and global
markets.
Social/geopolitical:
 Strengthen and protect minority languages.
 Support cultural diversity.
 Increase global presence and visibility.
ISO/TC 37 Terminology and other language and content resources
Is “managed” terminology really important for a
business?






In the automotive industry, almost 50% of translation errors
are “wrong term” (Woyde)
40% of time required for text production is terminology
work (Stellbrink)
Between 30% and 70% of errors in technical
documentation are terminology errors (Schutz, and
MULTIDOC)
Terminology work is necessary for between 4% and 6% of
all words in a text (Champagne)
Return on investment: 10% ($100 investment yields $110
return) (Champagne)
Outsourced translations may be 50% more expensive if
source terminology is inconsistent (Kjeldgaard)
ISO/TC 37 Terminology and other language and content resources
Need more proof?





Terminology tools increase productivity by approx. 20%
(Champagne)
Without a central reference, each needless search can take
20 to 30 minutes (Champagne)
It costs 10 times more to fix a term at the end of the
production cycle than at the beginning (Xerox, JDEdwards)
Inconsistent or inaccurate terminology raises service costs
Terminology mistakes can lead to lawsuits for copyright or
trademark infringement, or for damages due to defective
products or incorrect user documentation.
ISO/TC 37 Terminology and other language and content resources
IBM scenario…

“Terminology work is necessary for between 4% and
6% of all words in a text (Champagne)”


429 million words are translated per year in IBM.
Thus over 21 million words require attention.
In 2009, IBM “processed” over 160,000 terms as part
of the “content conveyor belt”, in nearly 3,000
specialized “dictionaries”.

Very small staff

High degree of automation
ISO/TC 37 Terminology and other language and content resources
What “measures” need to be taken?





Deploy a terminology database that serves multiple
purposes
Integrate the database into all content environments to
ensure a “push” mechanism
Respect data management principles, such as data
granularity, elementarity, etc.
Adopt best practices for terminology, such as term
autonomy and concept orientation
Allow for extensibility for features such as morphology
as needed for future applications
ISO/TC 37 Terminology and other language and content resources
Basic example – repurpose information
<h1>CI revision conflicts</h1>
<p>When revising your <term keyref="ci">CI</term>, to avoid
conflicts...</p>
<glossentry id="ci">
<glossterm>configuration item</glossterm>
<glossdef>An entity in a configuration that satisfies an
end-use function and can be uniquely identified.</glossdef>
<glossBody>
<glossAlt>
<glossAcronym>CI</glossAcronym>
</glossAlt>
</glossBody>
</glossentry>
ISO/TC 37 Terminology and other language and content resources
Controlled authoring
ISO/TC 37 Terminology and other language and content resources
Controlled translation
ISO/TC 37 Terminology and other language and content resources
Search – Query expansion
ISO/TC 37 Terminology and other language and content resources
Search – Query expansion
ISO/TC 37 Terminology and other language and content resources
Source data…
ISO/TC 37 Terminology and other language and content resources
Search – Query correction
ISO/TC 37 Terminology and other language and content resources
Synonyms/inconsistencies multiply in the target
language – this is bad for business
automatic memory
reclamation
remise en état automatique du mémoire
automatic storage
reclamation
remise en état automatique de l’archivage
récupération automatique de mémoire
remise en état automatique du stockage
récupération automatique de l’archivage
récupération automique du stockage
garbage collection
récupération de place
vidage de la corbeille
récupération de place en mémoire
récupération de positions inutilisées
récupération de l’espace mémoire
ISO/TC 37 Terminology and other language and content resources
“Cosmetic” differences can become more
than cosmetic in the target language
1.
2.
3.
administration console
admin console
administrative console
1.
pupitre d’administration
2.
console d’administration
3.
pupitre admin
4.
console admin
5.
pupitre administratif
6.
console administrative
ISO/TC 37 Terminology and other language and content resources
Explosion of affected compounds…





administrative console application / administration console
application
administrative console button / administration console button
administrative console login page / administration console
login page
core administrative console / core administration console
....
ISO/TC 37 Terminology and other language and content resources
Fixing the problem isn’t easy…



Change “pupitre” to “console”…..
Le pupitre administratif est ouvert. Vous devez le
fermer.
La console administrative est ouverte. Vous devez la
fermer.
ISO/TC 37 Terminology and other language and content resources
Development of terminology resources is
also key for language planning




Prescriptive terminology approach – just like in
enterprise environments
The Canadian experience: Termium, the BTQ
Other examples: Danterm, Korterm, Eurotermbank
Termbases feed into widely-distributed bulletins and
other distribution media to support adoption and
language reinforcement
 As an educational resource
 For social and political policies
 As an aid to commerce
ISO/TC 37 Terminology and other language and content resources
Effective management of language
resources requires
adherence to standards
and best practices
ISO/TC 37 Terminology and other language and content resources
ISO/TC 37 Terminology and other language and content resources
ISO/TC 37 Terminology and other language and content resources
Interoperability requires adherence to standards





Interoperability between tools and applications: CAT tools vs
controlled authoring, Web interfaces, GMS, ECM, search
engines…
Interoperability between users – writers, translators, publicists
For delivering derivative products – glossaries, Web sites, etc.
For different purposes – learning, commercialization,
government, social services, language planning, tourism, etc.
For different media – online vs paper, hand-helds, transport
interfaces, broadcasting media, marketing collateral, etc.
ISO/TC 37 Terminology and other language and content resources
Standards at various levels







Data transfer
File format
File structure (data model)
Encoding
Markup
Syntax
Semantics
ISO/TC 37 Terminology and other language and content resources
ISO TC37 – Terminology and other language
and content resources
Standardization of principles, methods and
applications relating to terminology and other
language and content resources in the
contexts of multilingual communication and
cultural diversity.
Web site…
ISO/TC 37 Terminology and other language and content resources
TC37 Current focus areas








Word segmentation
Language annotations to facilitate machine
processing
Terminology policies
Translation quality
Simultaneous interpretation
Data categories - www.isocat.org
XML representation and exchange formats
Persistent identifiers in multilingual environments
ISO/TC 37 Terminology and other language and content resources
Key standards and best practices – ISO TC37








ISO 30042 – TBX
ISO 16642 – TMF
ISO 12620 new ISO TC37 Data Category Registry
ISO Concept Database
ISO 704 – Terminology work: Principles and methods
ISO 12616 – Translation-oriented terminography
ISO 26162 – Design, implementation and maintenance
of terminology management systems
Annotation schemes and frameworks (SC4)
ISO/TC 37 Terminology and other language and content resources
Training professionals in language resource
management – an opportunity!




Lack of university training programs
Lack of competency in existing fragmented university
courses
Increasing demand for qualified professionals
For example,
 LISA offered 6 workshops (there was a demand for
more) - 73 companies attended.
 TermNet summer school – attendance grows each
year
ISO/TC 37 Terminology and other language and content resources
Thank you!
ISO/TC 37 Terminology and other language and content resources
Download

ISO/TC 37 Terminology and other language and