What`s New in Unicode 4 - AD-COM

advertisement
What's New in Unicode 4.0
Additions to the standard include
minority scripts and 1,226 characters
MARK DAVIS
The introduction of Unicode has had a dramatic effect on the ability of companies to
produce localized software and Web sites, which in turn has enabled users to get access
to localized products much more quickly than they could in the past.
The Unicode Standard has been adopted by such industry leaders as Adobe, Apple, HP,
IBM, JustSystem, Microsoft, Oracle, PeopleSoft, SAP, Sun, Sybase and many others.
Unicode is required by modern standards such as XML, Java, ECMAScript (JavaScript),
LDAP, CORBA 3.0, WML and so on, and it is the official way to implement ISO/IEC
10646. It is supported in all modern software and standards.
Unicode 4.0 is the newest major version of the Unicode Standard, including a significant
update of its widely used Unicode Character Database. Version 4.0 defines more than
96,000 characters for the languages of the world and provides detailed properties and
algorithms for computer systems. The current release contains all the information needed
to update software to support the latest characters.
As a significant step towards the digital preservation of world heritage, this new version
encodes characters for Linear B and other ancient Mediterranean alphabets. At the same
time, it expands support for modern minority languages. This action removes a major
barrier that has prevented people from using their own languages on computers.
The text of the standard and the Unicode Standard Annexes has undergone substantial
revision. In particular, the Unicode Character Encoding Model is incorporated, resulting
in fully specified definitions and conformance requirements of UTF-8, UTF-16 and UTF32. These are also clearly contrasted with the in-process use of Unicode Strings. Other
changes include program identifiers, bidirectional, linebreaking and other boundaries,
case conversions and detection, and scripts.
The forthcoming book, The Unicode Standard, Version 4.0, together with the on-line
Unicode Standard Annexes and the Unicode Character Database, defines Version 4.0 of
the Unicode Standard. The book gives the general principles, requirements for
conformance and guidelines for implementers, followed by character code charts and
names. Some preliminary chapters of the book, as well as the final character code charts,
are available at www.unicode.org
Major additions to Version 4.0 since Version 3.0 include major changes to the
introductory and conformance chapters; extensive revisions to the discussion of
punctuation, symbols and format characters; extensive additions of Chinese, Japanese and
Korean characters to cover dictionaries and historic use; many new symbols for
mathematical and technical publication; and many individual characters such as currency
symbols for other scripts, including Indic, Khmer, Latin, Greek, Arabic and Syriac. The
specification of conformance requirements is substantially improved, incorporating the
character encoding model. The new version also includes encoding of supplementary
characters (those with code points greater than FFFF16); formalized policies for stability
of the standard; clarification of semantics of special characters, including the Byte Order
Mark; a major expansion of Unicode Character Database properties and of specifications
for text boundaries and casing. More minority scripts, including Limbu, Tai Le, Osmanya
and Philippine scripts, are in Version 4.0 as well as more historic scripts, including
Linear B, Cypriot and Ugaritic; and substantial improvements to the script descriptions,
particularly for Indic scripts and Khmer.
Allocation of Code Points in Unicode
4.0
(The character repertoire corresponds to ISO/IEC
10646:2003.)
Graphic
Format
Control
Private Use
Surrogate
Noncharacter
Reserved
96,248
134
65
137,468
2,048
66
878,083
Altogether, 1,226 new character assignments were made to Version 4.0, over and above
what were in Unicode 3.2. These additions include currency symbols, additional Latin
and Cyrillic characters, the Limbu and Tai Le scripts, Yijing Hexagram symbols, Khmer
symbols, Linear B syllables and ideograms, Cypriot, Ugaritic and a new block of
variation selectors (especially for future CJK variants). Double diacritic characters were
added for dictionary use.
These new characters extend the set of modern currency symbols and represent a greater
coverage of minority and historical scripts.
The Unicode Standard includes more than 70 different character properties, which are
used to determine and ensure that the behavior of characters is correct in different
programs and environments. The documentation for these properties has been expanded
and is available in a new UCD.html file at www.unicode.org/Public/4.0-Update/UCD4.0.0.html
Unicode 4.0 Annexes
The Unicode Standard contains a number of Unicode Standard Annexes. These Annexes
are available on-line as separate documents.
UAX #9: The Bidirectional Algorithm specifies the visual positioning of characters
flowing from right to left, such as Arabic or Hebrew.
UAX #11: East Asian Width describes an informative width property for Unicode
characters that is useful when interoperating with East Asian Legacy character sets.
UAX #14: Line Breaking Properties provides default definitions for line-break
boundaries, used when word-wrapping text.
UAX #15: Unicode Normalization Forms specifies four normalized forms of Unicode
text. With these forms, text that is equivalent on some level will have identical binary
representations.
UAX #24: Script Names describes a default assignment of script names (such as Latin,
Greek, Arabic, Cyrillic) to all Unicode characters. This information is useful in
mechanisms such as regular expressions.
UAX #29: Text Boundaries provides default definitions for grapheme cluster ("user
character"), word and sentence boundaries.
About the Unicode Consortium
The Unicode Consortium is a nonprofit organization founded to develop, extend and
promote use of the Unicode Standard, which specifies the representation of text in
modern software products and standards. The membership of the consortium represents a
broad spectrum of corporations and organizations in the computer and information
processing industry. The consortium is supported financially solely through membership
dues. Membership in the Unicode Consortium is open to organizations and individuals
anywhere in the world who support the Unicode Standard and wish to assist in its
extension and implementation.
Mark Davis is chief globalization architect at IBM. He co-founded the Unicode project
and has been the President of the Unicode Consortium since its incorporation. He can be
reached at mark.davis@jtcsv.com
Download