What's New in Unicode 4.0 Additions to the standard include minority scripts and 1,226 characters MARK DAVIS The introduction of Unicode has had a dramatic effect on the ability of companies to produce localized software and Web sites, which in turn has enabled users to get access to localized products much more quickly than they could in the past. The Unicode Standard has been adopted by such industry leaders as Adobe, Apple, HP, IBM, JustSystem, Microsoft, Oracle, PeopleSoft, SAP, Sun, Sybase and many others. Unicode is required by modern standards such as XML, Java, ECMAScript (JavaScript), LDAP, CORBA 3.0, WML and so on, and it is the official way to implement ISO/IEC 10646. It is supported in all modern software and standards. Unicode 4.0 is the newest major version of the Unicode Standard, including a significant update of its widely used Unicode Character Database. Version 4.0 defines more than 96,000 characters for the languages of the world and provides detailed properties and algorithms for computer systems. The current release contains all the information needed to update software to support the latest characters. As a significant step towards the digital preservation of world heritage, this new version encodes characters for Linear B and other ancient Mediterranean alphabets. At the same time, it expands support for modern minority languages. This action removes a major barrier that has prevented people from using their own languages on computers. The text of the standard and the Unicode Standard Annexes has undergone substantial revision. In particular, the Unicode Character Encoding Model is incorporated, resulting in fully specified definitions and conformance requirements of UTF-8, UTF-16 and UTF32. These are also clearly contrasted with the in-process use of Unicode Strings. Other changes include program identifiers, bidirectional, linebreaking and other boundaries, case conversions and detection, and scripts. The forthcoming book, The Unicode Standard, Version 4.0, together with the on-line Unicode Standard Annexes and the Unicode Character Database, defines Version 4.0 of the Unicode Standard. The book gives the general principles, requirements for conformance and guidelines for implementers, followed by character code charts and names. Some preliminary chapters of the book, as well as the final character code charts, are available at www.unicode.org Major additions to Version 4.0 since Version 3.0 include major changes to the introductory and conformance chapters; extensive revisions to the discussion of punctuation, symbols and format characters; extensive additions of Chinese, Japanese and Korean characters to cover dictionaries and historic use; many new symbols for mathematical and technical publication; and many individual characters such as currency symbols for other scripts, including Indic, Khmer, Latin, Greek, Arabic and Syriac. The specification of conformance requirements is substantially improved, incorporating the character encoding model. The new version also includes encoding of supplementary characters (those with code points greater than FFFF16); formalized policies for stability of the standard; clarification of semantics of special characters, including the Byte Order Mark; a major expansion of Unicode Character Database properties and of specifications for text boundaries and casing. More minority scripts, including Limbu, Tai Le, Osmanya and Philippine scripts, are in Version 4.0 as well as more historic scripts, including Linear B, Cypriot and Ugaritic; and substantial improvements to the script descriptions, particularly for Indic scripts and Khmer. Allocation of Code Points in Unicode 4.0 (The character repertoire corresponds to ISO/IEC 10646:2003.) Graphic Format Control Private Use Surrogate Noncharacter Reserved 96,248 134 65 137,468 2,048 66 878,083 Altogether, 1,226 new character assignments were made to Version 4.0, over and above what were in Unicode 3.2. These additions include currency symbols, additional Latin and Cyrillic characters, the Limbu and Tai Le scripts, Yijing Hexagram symbols, Khmer symbols, Linear B syllables and ideograms, Cypriot, Ugaritic and a new block of variation selectors (especially for future CJK variants). Double diacritic characters were added for dictionary use. These new characters extend the set of modern currency symbols and represent a greater coverage of minority and historical scripts. The Unicode Standard includes more than 70 different character properties, which are used to determine and ensure that the behavior of characters is correct in different programs and environments. The documentation for these properties has been expanded and is available in a new UCD.html file at www.unicode.org/Public/4.0-Update/UCD4.0.0.html Unicode 4.0 Annexes The Unicode Standard contains a number of Unicode Standard Annexes. These Annexes are available on-line as separate documents. UAX #9: The Bidirectional Algorithm specifies the visual positioning of characters flowing from right to left, such as Arabic or Hebrew. UAX #11: East Asian Width describes an informative width property for Unicode characters that is useful when interoperating with East Asian Legacy character sets. UAX #14: Line Breaking Properties provides default definitions for line-break boundaries, used when word-wrapping text. UAX #15: Unicode Normalization Forms specifies four normalized forms of Unicode text. With these forms, text that is equivalent on some level will have identical binary representations. UAX #24: Script Names describes a default assignment of script names (such as Latin, Greek, Arabic, Cyrillic) to all Unicode characters. This information is useful in mechanisms such as regular expressions. UAX #29: Text Boundaries provides default definitions for grapheme cluster ("user character"), word and sentence boundaries. About the Unicode Consortium The Unicode Consortium is a nonprofit organization founded to develop, extend and promote use of the Unicode Standard, which specifies the representation of text in modern software products and standards. The membership of the consortium represents a broad spectrum of corporations and organizations in the computer and information processing industry. The consortium is supported financially solely through membership dues. Membership in the Unicode Consortium is open to organizations and individuals anywhere in the world who support the Unicode Standard and wish to assist in its extension and implementation. Mark Davis is chief globalization architect at IBM. He co-founded the Unicode project and has been the President of the Unicode Consortium since its incorporation. He can be reached at mark.davis@jtcsv.com