icu_overview_iuc27

advertisement
ICU Overview
The Open-Source
Unicode Library, v3.2
Markus Scherer
ICU Manager
IBM Globalization Center of Competency
27th Internationalization and Unicode Conference
Berlin, Germany, April 2005
ICU Overview: The Open-Source Unicode Library, v3.2
Agenda
 Background
 What is ICU?
 Architecture Overview
 ICU Features and recent additions
 References
 Q and A
27th Internationalization and Unicode Conference
2
Berlin, Germany, April 2005
ICU Overview: The Open-Source Unicode Library, v3.2
Why Globalization?
27th Internationalization and Unicode Conference
3
Berlin, Germany, April 2005
ICU Overview: The Open-Source Unicode Library, v3.2
Unicode
 All world languages
 Efficient and effective processing
 Lossless data exchange
 Enables single-binary global software
 But… all languages ⇒ large, complex standard
– 1,400 pages + Annexes + additional standards
– 90,000+ characters
– Major update every 3 years
– 70 character properties, many multi-valued
– Affects many processes: display, line-break, regex, …
27th Internationalization and Unicode Conference
4
Berlin, Germany, April 2005
ICU Overview: The Open-Source Unicode Library, v3.2
Locales
 Features vary widely across languages & countries
– Sorting, line breaks, date/time/number/currency formatting,
codepage conversion, …
– Performance is key: easy to do the right thing; hard to do it
fast
27th Internationalization and Unicode Conference
5
Berlin, Germany, April 2005
ICU Overview: The Open-Source Unicode Library, v3.2
What is ICU?
 Globalization / Unicode / Locales
 Mature, widely used set of C/C++ and Java libraries
– Basis for Java 1.1 internationalization – but goes far beyond
– “ICU4C”: C/C++ libraries; “ICU4J”: Java library
 Very portable – identical results on all platforms / programming
languages
– C/C++: 30+ platforms/compilers
– Java: IBM & Sun JDK
 Full threading model; customizable; modular
 Open source – but not viral
 ICU 3.2: 78 languages; 118 countries; 870 codepages
27th Internationalization and Unicode Conference
6
Berlin, Germany, April 2005
ICU Overview: The Open-Source Unicode Library, v3.2
Who uses ICU? (Examples)
 Products Within IBM
– DB2, COBOL, InfoPrint Manager, Lotus Notes, Lotus
Workplace, Tivoli Presentation Services, WebSphere,
XML Parser, …
 Other Companies and Organizations
– Adobe, Apple (Mac OS X), BEA, CERN, Cognos,
Debian, HP, Inktomi, JD Edwards, Macromedia,
Mathworks, Mozilla, NCR, OpenOffice, PayPal, SAP,
Siebel, SIL, Software AG, Sun Microsystems (Solaris,
Java), SuSE, Sybase, webMethods, …
27th Internationalization and Unicode Conference
7
Berlin, Germany, April 2005
ICU Overview: The Open-Source Unicode Library, v3.2
ICU Features
 Unicode text handling
 Unicode Regular
Expressions
 Charset conversions (870+)
 Breaks: word, line, …
 Collation & Searching
 Formatting
 Locales (170+)
– Date & time
 Resource Bundles
– Messages
 Calendar & Time zones
– Numbers & currencies
 Complex-text layout engine
 Transforms
– Normalization
– Casing
– Transliterations
27th Internationalization and Unicode Conference
8
Berlin, Germany, April 2005
ICU Overview: The Open-Source Unicode Library, v3.2
Architecture Overview 1
 Locale Based Services
– Locale is an identifier, not a container
– Keywords for variants: de@collation=phonebook
– Recent addition: accept-language support
 Resource inheritance: shared resources
root
Language
en
de
zh
Hant
Script
Country
US
IE
27th Internationalization and Unicode Conference
DE
CH
TW
9
Hans
CN
CN
TW
Berlin, Germany, April 2005
ICU Overview: The Open-Source Unicode Library, v3.2
Architecture Overview 2
 Open and Close Service Model
– Open a service object, use it many times, close it when done
– Better performance by avoiding setup costs per operation
– Warning: use properly for maximum performace
 ICU Threading Model
– Multiple service objects in use simultaneously, with same or
different attributes
– Large resources shared in read-only cache
27th Internationalization and Unicode Conference
10
Berlin, Germany, April 2005
ICU Overview: The Open-Source Unicode Library, v3.2
Architecture Overview 3
 Data Driven Services
– Customize at build-time or run-time
– Interchange with other platforms;
• same results on each
– Rule-based
• Collation, Word-breaks, Transforms
– Pattern-based
• Formats, UnicodeSet
– Table-based
• Character Conversion
27th Internationalization and Unicode Conference
11
Berlin, Germany, April 2005
ICU Overview: The Open-Source Unicode Library, v3.2
Architecture Overview – ICU4C
 Simple Error Handling
– C++ subset for portability
– Support for multi-threaded environment
 Version Management
– Multiple versions at the same time
– Data and library versioning
 String Buffer Management
– Preflighting and overflow protection
 Misc: Load/Unload ICU
 Recent Additions:
– Runtime-settable memory allocation and mutex functions
27th Internationalization and Unicode Conference
12
Berlin, Germany, April 2005
ICU Overview: The Open-Source Unicode Library, v3.2
Architecture Overview – ICU4J
 Supplement for Java
 Core globalization (no character conversion or
regular expressions, no GUI components)
– We do supply complex text support for Sun
 Modularized: products may add just needed
functionality
27th Internationalization and Unicode Conference
13
Berlin, Germany, April 2005
ICU Overview: The Open-Source Unicode Library, v3.2
ICU4J vs. JDK
 CLDR 1.2 (Common Locale Data Repository)
 Up-to-date globalization: standards-compliant; latest Unicode
– Supplementary character (GB 18030, JIS X 213, HKSCS)
• Java 5 adds handling of supplementary characters
– Full properties – JDK has only a fraction
– Unicode Collation Algorithm
– Local calendars (Thailand, Japan,…); ISO dates
– Currencies, String Search, Int’l Domain Names
– Transforms: Case, Scripts, Normalization
 Much faster turn-around on bug fixes, enhancements
27th Internationalization and Unicode Conference
14
Berlin, Germany, April 2005
ICU Overview: The Open-Source Unicode Library, v3.2
Unicode Text Handling
C
– UChar*: null-terminated or with length
 C++
– UnicodeString: full featured string class
 Java
– Uses normal JDK String, adds utilities
 All handle supplementary characters
– Required for GB 18030/JIS X 0213/HKSCS repertoires
27th Internationalization and Unicode Conference
15
Berlin, Germany, April 2005
ICU Overview: The Open-Source Unicode Library, v3.2
Unicode Text Handling 2
 All Unicode 4.0.1 properties
– Direct API
• Values, names, enumerations
– UnicodeSet
• Fast, compact set operations
• Pattern-based (both Perl & POSIX syntax for properties)
– \p{greek} vs. [:greek:]
• All properties:
– [\p{lowercase}-[a-z]]
– [\p{greek} & \p{uppercase}]
27th Internationalization and Unicode Conference
16
Berlin, Germany, April 2005
ICU Overview: The Open-Source Unicode Library, v3.2
Data: Recent Additions
 Conforms to CLDR 1.2
– 50% more data than CLDR 1.0: adding many translated terms for
languages, scripts, countries, currencies, and time zones.
– Added data for new languages: Malayalam, Oriya, Welsh
 Reduced multiplatform install image size
 Improved XLIFF-ICU conversion tools
 Locale canonicalization spec defined and implemented (C+J)
– Provides interoperability with POSIX and .NET locale IDs, more
RFC 3066 support
27th Internationalization and Unicode Conference
17
Berlin, Germany, April 2005
ICU Overview: The Open-Source Unicode Library, v3.2
Character Set Conversion
 Precise alias information:
– When you ask for “SJIS”, you can request the precise
definition by platform:
• Windows, IBM, Solaris,…
 Buffer management
– automatically handles characters that cross buffers
 Customizations allowed for:
– illegal sequences
– undefined characters
 Unicode Text Compression – SCSU, BOCU-1
27th Internationalization and Unicode Conference
18
Berlin, Germany, April 2005
ICU Overview: The Open-Source Unicode Library, v3.2
Collation and Searching
 Fast international comparison and string search;
fully UCA compliant
– Compressed sort keys, optimized string comparison,
sublinear string search
– incremental sortkeys for radix-sort
 Precise binary sortkey stability over time
 Fully data driven
 API / rule customizations
– strength, normalization, upper vs. lowercase first, ignore
punctuation, sort digits as numbers, …
27th Internationalization and Unicode Conference
19
Berlin, Germany, April 2005
ICU Overview: The Open-Source Unicode Library, v3.2
Collation and Searching: Recent Additions
 Numeric sorting: sequences of digits can be sorted
numerically instead of alphabetically
– e.g., filenames would sort "ab-2" < "ab-10"
– without material performance cost
– with reduced sortkey length.
 Significantly improved sorting orders for many other
languages
 Data in separate tree, for easier modularization and
maintenance
 getFunctionalEquivalent API allows for better caching and UI
support.
27th Internationalization and Unicode Conference
20
Berlin, Germany, April 2005
ICU Overview: The Open-Source Unicode Library, v3.2
Calendar & Time Zones
 International Calendars – Arabic, Buddhist, Hebrew, Japanese
– Required for correct presentation of dates in some countries
 Olson timezone support, with localizations
 Recent Additions:
– RFC822 time zone format support in DateFormat (C+J) for
compatibility.
– “Universal Time” conversions for high-precision date/time
computations
27th Internationalization and Unicode Conference
21
Berlin, Germany, April 2005
ICU Overview: The Open-Source Unicode Library, v3.2
Formatting
 Date & time: 8 formats per locale
 Messages
– Completely localizable, Plural support
 Numbers & currencies
– Scientific Notation, Spelled-out (checks, etc.)
– Full Orthogonal Currency support
• INR
• INR
• INR
In Hindi:
In English:
In German:
Rs. 1,234.57
Rs. 1.234,57
 Recent Additions
– POSIX migration library
– Allows parsing multiple currencies with one formatter
– Short and stand-alone month/day names
27th Internationalization and Unicode Conference
22
Berlin, Germany, April 2005
ICU Overview: The Open-Source Unicode Library, v3.2
Transforms
 Unicode Normalization
– Highly optimized for performance
– performance utilities: concatenation, detection, comparison
 Casing (upper, lower, title, folding)
 General Transforms
– Script transliterations
– Half-width/Full-width, Hex, etc.
– Chain transforms together, filter source characters
– Rule-based, customizable at runtime.
 IDNA: International Domain Names
27th Internationalization and Unicode Conference
23
Berlin, Germany, April 2005
ICU Overview: The Open-Source Unicode Library, v3.2
Segmentation: word, line & sentence
 Fast state-table implementation
 Customizable
– Rule-based – customizable at runtime
– Special customizations, e.g. Thai
 Recent Additions:
– Greatly improved performance when going backwards
(common case when doing line break)
– Java
• The rules syntax has been extended. Rules can now return
information about the types of characters they encountered.
• Common compiled (binary) rule format with ICU4C
27th Internationalization and Unicode Conference
24
Berlin, Germany, April 2005
ICU Overview: The Open-Source Unicode Library, v3.2
Unicode Regular Expressions
 Full Regex Implementation
– C only: Java 1.4 has own package (though not as powerful)
 All Unicode 4.0.1 Properties
– supported through UnicodeSet
 Good performance
– competitive with non-Unicode regex
 Recent Additions
– Now features a C API, instead of just C++.
27th Internationalization and Unicode Conference
25
Berlin, Germany, April 2005
ICU Overview: The Open-Source Unicode Library, v3.2
Complex-text layout engine
 Glyph processing, positioning & adjustment
– ligature substitution, contextual forms, kerning, accent placement,
Bidi scripts, etc.
 Support for:
– Drawing
– Caret Display
– Hit Testing
– Selection Highlighting
– Caret Movement
– Layout Metrics
– Line Break
 Recent addition: Canonical Equivalence: a + ´ or á
27th Internationalization and Unicode Conference
26
Berlin, Germany, April 2005
ICU Overview: The Open-Source Unicode Library, v3.2
References
 ICU main site:
– http://ibm.com/software/globalization/icu
• New URL
– Links to
• Download ICU
• User Guide, Technical FAQ, Support, Bug Reports
 Unicode Consortium
– http://www.unicode.org
• Unicode glossary, Unicode character database
27th Internationalization and Unicode Conference
27
Berlin, Germany, April 2005
ICU Overview: The Open-Source Unicode Library, v3.2
Questions and Answers
27th Internationalization and Unicode Conference
28
Berlin, Germany, April 2005
Download