From UCS-2 to UTF-16

advertisement
From UCS-2 to UTF-16
Discussion and practical example for
the transition of a Unicode library
from UCS-2 to UTF-16
Why is this an issue?
• The concept of the Unicode standard
changed during its first few years
• Unicode 2.0 (1996) expanded the code point
range from 64k to 1.1M
• APIs and libraries need to follow this
change and support the full range
• Upcoming character assignments (Unicode
3.1, 2001) fall into the added range
“Unicode is a 16-bit character set”
• Concept: 16-bit, fixed-width character set
• Saving space by not including
precomposed, rarely-used, obsolete, …
characters
• Compatibility, transition strategies, and
acceptance forced loosening of these
principles
• Unicode 3.1: >90k assigned characters
16-bit APIs
• APIs developed for Unicode 1.1 used 16-bit
characters and strings: UCS-2
• Assuming 1:1 character:code unit
• Examples: Win32, Java, COM, ICU,
Qt/KDE
• Byte-based UTF-8 (1993) mostly for MBCS
compatibility and transfer protocols
Extending the range
• Set aside two blocks of 1k 16-bit values,
“surrogates”, for extension
• 1k x 1k = 1M = 10000016 additional code
points using a pair of code units
• 16-bit form now variable-width UTF-16
• “Unicode scalar values” 0..10ffff16
• Proposed: 1994; part of Unicode 2.0 (1996)
Parallel with ISO-10646
•
•
•
•
•
ISO-10646 uses 31-bit codes: UCS-4
UCS-2: 16-bit codes for subset 0..ffff16
UTF-16: transformation of subset 0..10ffff16
UTF-8 covers all 31 bits
Private Use areas above 10ffff16 slated for
removal from ISO-10646 for UTF
interoperability and synchronization with
Unicode
21-bit code points
• Code points (“Unicode scalar values”) up to
10ffff16 use 21 bits
• 16-bit code units still good for strings:
variable-width like MBCS
• Default string unit size not big enough for
code points
• Dual types for programming?
C: char/wchar_t dual types
• C/C++ standards: dual types
• Strings mostly with char units (8 bits)
• Code points: wchar_t, 8..32 bits
• Typical use in I18N-ed programs: (8-bit)
char strings but (16/32-bit) wchar_t (or
32-bit int) characters; code point type is
implementation-dependent
Unicode: dual types, too?
• Strings could continue with 16-bit units
• Single code points could get 32-bit data
type
• Dual-type model like C/C++ MBCS
Alternatives to dual 16/32 types
• UTF-32: all types 32 bits wide, fixed-width
• UTF-8: same complexity after range
extension beyond just the BMP, closer to
C/C++ model – byte-based
• Use pairs of 16-bit units
• Use strings for everything
• Make string unit size flexible 8/16/32 bits
UCS-2 to UTF-32
• Fixed-width, single base type for strings and
code points
• UCS-2 programming assumptions mostly
intact
• Wastes at least 33% space, typically 50%
• Performance bottleneck CPU - memory
UCS-2 to UTF-8
• UCS-2 programming assumes many
characters in single code units
• Breaks a lot of code
• Same question of type for code points;
follow C model, 32-bit wchar_t? – More
difficult transition than other choices
Surrogate pairs for single chars
• Caller avoids code point calculation
• But: caller and callee need to detect and
handle pairs: caller choosing argument
values, callee checking for errors
• Harder to use with code point constants
because they are published as scalar values
• Significant change for caller from using
scalars
Strings for single chars
• Always pass in string (and offset)
• Most general, handles graphemes in
addition to code points
• Harder to use with code point constants
because they are published as scalar values
• Significant change for caller from using
scalars
UTF-flexible
• In principle, if the implementation can
handle variable-width, MBCS-style strings,
could it handle any UTF-size as a compiletime choice?
• Adds interoperability with UTF-8/32 APIs
• Almost no assumptions possible
• Complexity of transition even higher than
of transition to pure UTF-8, performance?
Interoperability
• Break existing API users no more than
necessary
• Interoperability with other APIs: Win32,
Java, COM, now also XML DOM
• UTF-16 is Unicode default: good
compromise (speed/ease/space)
• String units should stay 16 bits wide
Does everything need to change?
• String operations: search, substring,
concatenation, … work with any UTF
without change
• Character property lookup and similar: need
to support the extended range
• Formatting: should handle more code points
or even graphemes
• Careful evaluation of all public APIs
ICU: some of all
• Strings: UTF-16, UChar type remains 16bit
• New UChar32 for code points
• Provide macros for C to deal with all UTFs:
iteration, random access, …
• C++ CharacterIterator: many new functions
• Property lookup/low-level: UChar32
• Formatting: strings for graphemes
Scalar code points:
property lookup
• Old, 16-bit:
UChar u_tolower(UChar c){
u[v[c15..7]+c6..0];
}
• New, 21-bit:
UChar32 u_tolower(UChar32 c){
u[v[w[c20..10]+c9..4]+c3..0];
}
Formatting: grapheme strings
• Old:
void setDecimalSymbol(UChar
c);
• New:
void setDecimalSymbol(const
UnicodeString &s);
Codepage conversion
• To Unicode: results are one or two UTF-16
code units, surrogates stored directly in the
conversion table
• From Unicode: triple-stage compact array
access from 21-bit code points like property
lookup
• Single-character-conversion to Unicode
now returns UChar32 values
API first…
• Tools and basic functions and classes are in
place (property lookup, conversion,
iterators, BiDi)
• Public APIs reviewed and changed
(“luxury” of early project stage) or
deprecated and superseded by new versions
• Higher-level implementations to follow
before Unicode 3.1 published
More implementations follow…
• Collation: need to prepare for >64k primary
keys
• Normalization and Transliteration
• Word/Sentence break iteration
• Etc.
• No non-BMP data before Unicode 3.1 is
stable
Other libraries
• Java: planning stage for transition
• Win32: rendering and UniScribe API
largely UTF-16-ready
• Linux: standardizing on 32-bit Unicode
wchar_t, has UTF-8 locales like other
Unixes for char* APIs
• W3C: standards assume full UTF-16 range
Summary
• Transition from UCS-2 to UTF-16 gains
importance after four years of standard
• APIs for single characters need change or
new versions
• String APIs: no change
• Implementations need to handle 21-bit code
points
• Range of options
Resources
• Unicode FAQ:
http://www.unicode.org/unicode/faq/
• Unicode on IBM developerWorks:
http://www.ibm.com/developer/unicode/
• ICU: http://oss.software.ibm.com/icu/
Download