Localization Data

advertisement
Localization Data
Mark Davis, PhD
Chief SW Globalization Arch., IBM
President, Unicode Consortium
International Summit on Localisation (MAIT/TDIL)
New Delhi, 2004-12-08 (R2)
Importance of Standards



Products developed in each country
interoperate with other products:
inside and outside that country
Mechanism for countries / industries
to promulgate best practices
SW Localization
• Unicode: Universal character encoding
• CLDR: Common Locale Data Repository
Universal Character Encoding

Unicode: Unique character codes for
all languages
…
Common Locale Data Repository

Relatively new project: 2004

Hosted by Unicode Consortium
• http://www.unicode.org/cldr/

Goals:
• Common, required SW locale data for
world languages
• XML format for effective interchange
• Freely available
What is Locale Data


Locale = identifier string referring to
linguistic and cultural preferences
Typical data
•
•
•
•
Dates/time formats
Number/Currency formats
Measurement System
Collation Specification (Collation)

Used for sorting, searching, matching
• Translated names for language, territory,
script, timezones, currencies,…
Latest Release: CLDR 1.2

Released:
November, 2004
locales

Approved:

Draft:

Data
languages territories
232
72
108
63
27
28
• Unique XPaths:
2,540
• Actual Values:
56,290
• Fully Resolved:
358,860
(not including collation, aliased data)
Next Release: CLDR 1.3

Jan 2005: Freeze date
• For new enhancement requests & bug reports

Apr 2005: Target release date

Planned features
•
•
•
•
New data / corrections / tests (ongoing)
Survey tool
POSIX conversion tool
Additional Mechanisms




lenient date/time/number parsing;
different combinations of date fields;
names for dialects, measurement systems;
narrative reference information
Usage (direct or indirect)

Caveats
• Not a complete list: usage is not tracked, so this is an
estimate
• CLDR first available in 2004, so may use precursor data

Companies / Organizations
• Adobe, Apple (Mac OS X), abas Software, Argonne National
Laboratory, Ascential Software, Avaya, BEA, BroadJump,
BluePhoenix Solutions, BMC Software (Remedy), Business
Objects, caris, CERN, Cognos, Debian Linux, Gentoo Linux, HP,
Hyperion, IBM, Inktomi, Innodata Isogen, Informatica, Intel,
Interlogics, IONA, IXOS, JD Edwards, Jikes, Macromedia,
Mathworks, Mozilla, OpenOffice, Language Analysis Systems,
Lawson Software, Leica Geosystems GIS & Mapping LLC,
Mandrake Linux, Parrot, PayPal, Progress Software, Python,
QNX, Rogue Wave, SAP, Siebel, SIL, SPSS, Software AG, Sun
Microsystems (Solaris, Java), SuSE, Sybase, Teradata (NCR),
Trend Micro, Virage, webMethods, Wine, WMS Gaming,…

Optional use:
• Apache, Perl, Xalan, Xerces, …
Sample: Languages, Scripts,
Territories
<localeDisplayNames>
<languages>
<language type="aa">Afar</language>
<language type="ab">Abkhasisk</language>…
<scripts>
<script type="Arab">Arabisk</script>…
<territories>
<territory type="AD">Andorra</territory>
<territory type="AE">Forenede Arabiske Emirater
</territory>…
Sample: Characters / Dates
<characters>
<exemplarCharacters>[a-z æ å ø á é í ó ú ý]
</exemplarCharacters>
</characters>…
<dayContext type="format">
<dayWidth type="abbreviated">
<day type="sun">søn</day>
<day type="mon">man</day>…
Sample: Timezones / Currencies
<timeZoneNames>
<zone type="America/Los_Angeles">
<long>
<standard>Pacific-normaltid</standard>
<daylight>Pacific-sommertid</daylight>
</long>…
<currencies>
<currency type="GAF">
<displayName>Gabonesisk CFA-franc
</displayName>
<symbol>GAF</symbol>…
Sample: Collation
<collation type="standard" >
<settings normalization="on" />
<rules>
<reset before="primary">0</reset>
<pc>ॐ।॥॰</pc>
<reset>ह</reset>
<pc> ़़़़ ़़़़</pc>
<reset>ऽ</reset>
<p> ़</p>
Committee Process

For most effective participation from
people around the world
• Meetings

By phone, never F2F

Short, often

Allows preparation between meetings
• Written

Email

Database submissions
Vetting Process for Data

Collect from different platforms, experts, submissions: new or
revised
• References to external sources strongly encouraged
• Must be before freeze date for release
• Will use Survey Tool

Enter in the repository
• Mark with draft attribute
• Add references, standards

Verify by CLDR committee members
• Consulting with country contacts
• If disagreement, decide in committee

Accept
• As main form: draft attribute removed
• As alternate form: marked with different attributes
Challenges

Aggressive, 6 month release schedule

Complex Formats
• Collation, Date Formats, Exemplar characters, etc.
• Require close interaction of CLDR experts with language
experts

Choosing most customary, acceptable forms
•
•
•
•
•
•
Regional differences, individual preferences
Context (months in formats vs. calendars)
Uncommon cases (“Interlingua”)
Standards vs. common modern usage
Obtaining references for data
But can have multiple, alternate versions
Getting Involved

Simplest
• Bug report / feature request – anyone!

More Involved
• Vetting, Assessment, Tools, Policies, Decisions,
…
• Any Unicode member eligible to name
representatives

Full members: IBM, Apple, Sun, Oracle, India,…

Liaison members: Ireland, Finland, …

Associate members: Tamil Nadu, …
Example Country Process (Finland)

Finnish Ministry of Education made
CLDR data a major goal, 2004-06
• Research Institute for the Languages of
Finland ("RILF" aka "Kotus") designated
agency
• Documenting the national preferences in
the open even more important than
implementations
• Results expected to lead to new/revised
national standards
Example Country Process (II)

RILF a Unicode Liaison member, 2004-07
• Set up fully open national group on language and
cultural requirements on ICT, 2004-09
• Two official languages (Finnish and Swedish) &
four regional / minority languages (three Sámi &
Romani as spoken in Finland) to be covered
• Over 30 different parties represented:
commercial, non-commercial, individuals
• Public comments to be allowed:
http://kotoistus.fi
• Documentation for all controversial issues and
deviations from any national standards
For more information

Unicode
• http://www.unicode.org/

CLDR
• http://www.unicode.org/cldr/

This presentation
• http://www.macchiato.com/slides/Locali
zation.ppt
Download