Localization Data Mark Davis, PhD Chief SW Globalization Arch., IBM President, Unicode Consortium International Summit on Localisation (MAIT/TDIL) New Delhi, 2004-12-08 (R2) Importance of Standards Products developed in each country interoperate with other products: inside and outside that country Mechanism for countries / industries to promulgate best practices SW Localization • Unicode: Universal character encoding • CLDR: Common Locale Data Repository Universal Character Encoding Unicode: Unique character codes for all languages … Common Locale Data Repository Relatively new project: 2004 Hosted by Unicode Consortium • http://www.unicode.org/cldr/ Goals: • Common, required SW locale data for world languages • XML format for effective interchange • Freely available What is Locale Data Locale = identifier string referring to linguistic and cultural preferences Typical data • • • • Dates/time formats Number/Currency formats Measurement System Collation Specification (Collation) Used for sorting, searching, matching • Translated names for language, territory, script, timezones, currencies,… Latest Release: CLDR 1.2 Released: November, 2004 locales Approved: Draft: Data languages territories 232 72 108 63 27 28 • Unique XPaths: 2,540 • Actual Values: 56,290 • Fully Resolved: 358,860 (not including collation, aliased data) Next Release: CLDR 1.3 Jan 2005: Freeze date • For new enhancement requests & bug reports Apr 2005: Target release date Planned features • • • • New data / corrections / tests (ongoing) Survey tool POSIX conversion tool Additional Mechanisms lenient date/time/number parsing; different combinations of date fields; names for dialects, measurement systems; narrative reference information Usage (direct or indirect) Caveats • Not a complete list: usage is not tracked, so this is an estimate • CLDR first available in 2004, so may use precursor data Companies / Organizations • Adobe, Apple (Mac OS X), abas Software, Argonne National Laboratory, Ascential Software, Avaya, BEA, BroadJump, BluePhoenix Solutions, BMC Software (Remedy), Business Objects, caris, CERN, Cognos, Debian Linux, Gentoo Linux, HP, Hyperion, IBM, Inktomi, Innodata Isogen, Informatica, Intel, Interlogics, IONA, IXOS, JD Edwards, Jikes, Macromedia, Mathworks, Mozilla, OpenOffice, Language Analysis Systems, Lawson Software, Leica Geosystems GIS & Mapping LLC, Mandrake Linux, Parrot, PayPal, Progress Software, Python, QNX, Rogue Wave, SAP, Siebel, SIL, SPSS, Software AG, Sun Microsystems (Solaris, Java), SuSE, Sybase, Teradata (NCR), Trend Micro, Virage, webMethods, Wine, WMS Gaming,… Optional use: • Apache, Perl, Xalan, Xerces, … Sample: Languages, Scripts, Territories <localeDisplayNames> <languages> <language type="aa">Afar</language> <language type="ab">Abkhasisk</language>… <scripts> <script type="Arab">Arabisk</script>… <territories> <territory type="AD">Andorra</territory> <territory type="AE">Forenede Arabiske Emirater </territory>… Sample: Characters / Dates <characters> <exemplarCharacters>[a-z æ å ø á é í ó ú ý] </exemplarCharacters> </characters>… <dayContext type="format"> <dayWidth type="abbreviated"> <day type="sun">søn</day> <day type="mon">man</day>… Sample: Timezones / Currencies <timeZoneNames> <zone type="America/Los_Angeles"> <long> <standard>Pacific-normaltid</standard> <daylight>Pacific-sommertid</daylight> </long>… <currencies> <currency type="GAF"> <displayName>Gabonesisk CFA-franc </displayName> <symbol>GAF</symbol>… Sample: Collation <collation type="standard" > <settings normalization="on" /> <rules> <reset before="primary">0</reset> <pc>ॐ।॥॰</pc> <reset>ह</reset> <pc> ़़़़ ़़़़</pc> <reset>ऽ</reset> <p> ़</p> Committee Process For most effective participation from people around the world • Meetings By phone, never F2F Short, often Allows preparation between meetings • Written Email Database submissions Vetting Process for Data Collect from different platforms, experts, submissions: new or revised • References to external sources strongly encouraged • Must be before freeze date for release • Will use Survey Tool Enter in the repository • Mark with draft attribute • Add references, standards Verify by CLDR committee members • Consulting with country contacts • If disagreement, decide in committee Accept • As main form: draft attribute removed • As alternate form: marked with different attributes Challenges Aggressive, 6 month release schedule Complex Formats • Collation, Date Formats, Exemplar characters, etc. • Require close interaction of CLDR experts with language experts Choosing most customary, acceptable forms • • • • • • Regional differences, individual preferences Context (months in formats vs. calendars) Uncommon cases (“Interlingua”) Standards vs. common modern usage Obtaining references for data But can have multiple, alternate versions Getting Involved Simplest • Bug report / feature request – anyone! More Involved • Vetting, Assessment, Tools, Policies, Decisions, … • Any Unicode member eligible to name representatives Full members: IBM, Apple, Sun, Oracle, India,… Liaison members: Ireland, Finland, … Associate members: Tamil Nadu, … Example Country Process (Finland) Finnish Ministry of Education made CLDR data a major goal, 2004-06 • Research Institute for the Languages of Finland ("RILF" aka "Kotus") designated agency • Documenting the national preferences in the open even more important than implementations • Results expected to lead to new/revised national standards Example Country Process (II) RILF a Unicode Liaison member, 2004-07 • Set up fully open national group on language and cultural requirements on ICT, 2004-09 • Two official languages (Finnish and Swedish) & four regional / minority languages (three Sámi & Romani as spoken in Finland) to be covered • Over 30 different parties represented: commercial, non-commercial, individuals • Public comments to be allowed: http://kotoistus.fi • Documentation for all controversial issues and deviations from any national standards For more information Unicode • http://www.unicode.org/ CLDR • http://www.unicode.org/cldr/ This presentation • http://www.macchiato.com/slides/Locali zation.ppt