Must Know about Unicode Vinson Hsieh 1 如果不知道你拿到的字串是什麼 encoding其實你不該寫code, 直到你懂為止 2 ASCII ANSI Unicode 3 世界的演變 • When Unix was being invented and K&R (Brian Kernighan and Dennis Ritchie) were writing The C Programming Language, everything was very simple. • The only characters that mattered were good old unaccented English letters, we had a code for them called ASCII which was able to represent every character using a number between 32 and 127 . This could conveniently be stored in 7 bits. • Codes below 32 were called unprintable . They were used for control characters, like 7 which made your computer beep and 12 which caused the current page of paper to go flying out of the printer and a new one to be fed in. 4 ASCII The lower 128 (codes 0-127) are the most often used codes. Early email systems in fact would only allow you to transmit characters 0-127 (i.e. "7-bit text") 5 Plain text = ASCII = Characters 8 bits • Most computers in those days were using 8bit bytes, so not only could you store every possible ASCII character, but you had a whole bit to spare. • 『gosh, we can use the codes 128-255 for our own purposes.』 The trouble was, lots of people had this idea at the same time, and they had their own ideas of what should go where in the space from 128 to 255. 6 • The IBM-PC had something that came to be known as the OEM character set which provided some accented characters for European languages and a bunch of line drawing characters (Dos時代畫表格) IBM PC Code Page 850 7 Buying PCs outside of America • For example on some PCs the character code 130 would display as é, but on computers sold in Israel it was the Hebrew letter Gimel ( ). In many cases, such as Russian, there were lots of different ideas of what to do with the upper-128 characters, so you couldn’t even reliably interchange Russian documents. 8 ANSI standard • Eventually this OEM free-for-all got codified in the ANSI standard. • Everybody agreed on what to do below 128, which was pretty much the same as ASCII, but there were lots of different ways to handle the characters from 128 and on up, depending on where you lived. • These different systems (國家/單位) were called code pages. 9 128 to 255才128個怎麼夠中國字用? Big 5? 10 DBCS • Asian alphabets have thousands of letters 8bits • This was usually solved by the messy system called DBCS, the 『double byte character set』 • Visual C++ 裡,MBCS 永遠是指 DBCS • 65536 可以表達六萬多個字 11 秦代的《倉頡》、《博學》、《爰歷》三篇共有3300 字,漢代揚雄作《訓纂篇》,有5340字,到許慎作 《説文解字》就有9353字了,晉宋以後,文字又日漸 增繁。據唐代封演《聞見記文字篇》所記晉呂忱作 《字林》,有12824字,後魏楊承慶作《字統》,有 13734字,梁顧野王作《玉篇》有16917字。唐代孫強 增字本《玉篇》有22561字。到宋代司馬光修《類篇》多 至31319字,到清代《康熙字典》就有47000多字了。1915 年歐陽博存等的《中華大字典》,有48000多字。1959年 日本諸橋轍次的《大漢和辭典》,收字49964個。1971年 張其昀主編的《中文大辭典》,有49888字 1990年徐仲舒主編的《漢語大字典》,收字數為54678個。1994年 冷玉龍等的《中華字海》,收字數更是驚人,多達85000字。 幸好《中華字海》一類字書裏收錄的漢字絕大部分是“死字”, 也就是歷史上存在過而今天的書面語裏已經廢置不用的字。 12 Shift-JIS Kanji Table Multibyte Character Sets take advantage of the fact that only the first 128 characters of the ASCII set are commonly used (codes 0-127 in decimal, or 0x00-0x7f in hex). When parsing Shift-JIS, if you get a byte in the range 0x800xff, you know it is the first character of a two code sequence. Else, it is a single byte of regular ASCII. http://www.rikai.com/library/kanjitables/kanji_codes.sjis.shtml 13 Character based applications use whichever code page is set as the active "OEM" (aka "MSDOS") code page and Win32 applications use whichever code page is set as the active "ANSI" code page. (Note that Windows "ANSI" code pages do not necessarily map to official ANSI standard character sets.) cp437 14 http://www.sqlsnippets.com/en/topic-13410.html ASCII = OEM character sets = MS-DOS ANSI = MBCS = DBCS = Windows 15 Can’t type Chinese now 16 Python Win32 Console (DBCS) Big 5 Code Table 0 1 2 3 4 5 6 7 8 9 a A7D0 役 忘 忌 志 忍 忱 快 忸 忪 戒 我 0 1 2 B750 感 想 愛 0 1 我愛你 should be \xa7\xda\xb7\x52\xa7\x41 A740 作 你 In ASCII, 52 = R, 41 = A So become to \xa7\xda\xb7R\xa7A (7F之前都會mapping到ASCII的0-127) 看起來\x會把後面兩個湊成一個字 17 「許功蓋」(DBCS) 最常見字:功餐許蓋閱 次常見字:擺珮豹枯淚穀愧 http://www.khngai.com/chinese/charmap/tblbig.php ASCII(5C) == “\” A45C么 AE5C娉 B85C稞 C25C擺 A55C功 AF5C珮 B95C鈾 C35C黠 A65C吒 B05C豹 BA5C暝 C45C孀 A75C吭 B15C崤 BB5C蓋 C55C髏 A85C沔 B25C淚 BC5C墦 C65C躡 A95C坼 B35C許 BD5C穀 AA5C歿 B45C廄 BE5C閱 AB5C俞 B55C琵 BF5C璞 AC5C枯 B65C跚 C05C餐 AD5C苒 B75C愧 C15C縷 ASCII(7C) == “|” AA7C泜 B47C揉 A87C育 BE7C魯 B27C琍 BC7C慝 C67C鸛 A97C尚 B37C逖 BD7C罵 A77C坑 B17C悴 BB7C誡 C57C疊 A67C帆 B07C院 BA7C漏 C47C辮 AB7C咽 B57C稅 BF7C糕 AC7C洱 B67C閏 C07C嚐 AD7C迢 B77C會 C17C舉 A47C弋 AE7C徑 B87C腮 C27C甕 A57C四 AF7C砝 B97C頌 C37C牘 Python 會把 ‘\’ 變成 ‘\\’,還不錯 ,可以翻回5C 18 Shift-JIS Kanji Table 5C/7C http://www.chi2ko.com/jingyan/shiftjis2uni.htm 19 How about move strings to another PC • Of course, as soon as the Internet happened, it became quite commonplace to move strings from one computer to another, and the whole mess came tumbling down. • Win95/98 時代 20 Windows 98 It has 16-bit Windows heritage – Almost everything using ANSI strings 21 Unicode • Unicode 只是一個字形和內碼上的標準, 並沒有定義實際在電腦上存取的方法,因 此Unicode協會便定義了一整套的電腦存取 Unicode編碼的轉換格式,並考慮了與其它 編碼方式兼容,稱之為UTF(Unicode/UCS Transformation Format,統一碼/通用字集 變換格式)。UTF8/16/32。 22 Unicode Code Point Chart • • • • • • • • • • • • • • • • • • U+0000 to U+007F: Basic Latin U+0080 to U+00FF: Latin-1 Supplement U+0100 to U+017F: Latin Extended-A U+0180 to U+024F: Latin Extended-B U+0250 to U+02AF: IPA Extensions U+02B0 to U+02FF: Spacing Modifier Letters U+0300 to U+036F: Combining Diacritical Marks U+0370 to U+03FF: Greek and Coptic U+0400 to U+04FF: Cyrillic U+0500 to U+052F: Cyrillic Supplement U+0530 to U+058F: Armenian U+0590 to U+05FF: Hebrew U+0600 to U+06FF: Arabic U+0700 to U+074F: Syriac U+0750 to U+077F: Arabic Supplement U+0780 to U+07BF: Thaana U+0900 to U+097F: Devanagari … http://inamidst.com/stuff/unidata/ 23 Unicode terminology Sample Unicode Symbols 03A0 Π Greek Capital Letter Pi 03A3 Σ Greek Capital Letter Sigma 03A9 Ω Greek Capital Letter Omega notation U+NNNN uni = {U+03A0} + {U+03A3} + {U+03A9} (ΠΣΩ) 24 Now, even though we know exactly what 'uni' represents (ΠΣΩ) note that there is no way to: 1. 2. 3. 4. Print uni to the screen. Save uni to a file. Add uni to another piece of text. Tell me how many bytes it takes to store uni. 25 Valid Coding of Ω Encoding name Binary representation ISO-8859-7 (OEM/ASCII) \xD9 "Native" Greek encoding UTF-8 \xCE\xA9 UTF-16 \xFF\xFE\xA9\x03 UTF-32 \xFF\xFE\x00\x00\xA9\x03\x00\x00 You should think of Unicode as symbols (Ω), not as bytes. 26 Converting Unicode symbols to Python literals Pseudocode: uni = ‘abc_’ + {U+03A0} + {U+03A3} + {U+03A9} + ‘.txt’ Here is how you make that string in Python: uni = u"abc_\u03a0\u03a3\u03a9.txt" Pseudocode: uni = {U+1A} + {U+B3C} + {U+1451} + {U+1D10C} Python: uni = u'\u001a\u0bc3\u1451\U0001d10c’ Python: uni = u'\u001A\u0BC3\u1451\U0001D10C' 27 Codecs • Unicode objects have no fixed computer representation. • Before an Unicode object can be printed, stored to disk, or sent across a network, it must be encoded into a fixed computer representation. This is done using a codec. Some popular codecs you may have heard about in your day to day experiences: ASCII, iso-8859-7,UTF-8, UTF-16. 28 轉換的正確觀念 • ANSI 和 Unicode間的轉換 • Big5 Unicode utf8/16/32 • utf8/16/32 Unicode Big5 29 http://www.unicode.org/versions/Unicode4.0.0/ch02.pdf#G18421 30 Unicode字元平面映射 http://zh.wikipedia.org/wiki/%E5%9F%BA%E6%9C%AC%E5%A4%9A%E6%96%87%E7 %A8%AE%E5%B9%B3%E9%9D%A2#.E5.9F.BA.E6.9C.AC.E5.A4.9A.E6.96.87.E7.A7.8D. E5.B9.B3.E9.9D.A2 31 UTF 32 (Always 4 bytes) UTF-32 - Each Unicode code point is represented directly by a single 32-bit code unit UTF-32 is restricted to representation of code points in the range 0..10FFFF16—that is, the Unicode codespace UTF-32 may be a preferred encoding form where memory or disk storage space for characters is no particular concern, but where fixed-width, single code unit access to characters is desired. UTF-32 is also a preferred encoding form for processing characters on most Unix platforms. 32 UTF 16 ( 2 or 4 bytes) code points in the range U+0000..U+FFFF are represented as a single 16-bit code unit; code points in the supplementary planes, in the range U+10000..U+10FFFF, are instead represented as pairs of 16-bit code units. These pairs of special code units are known as surrogate pairs. 33 UTF 8 (1 – 4 bytes) The UTF-8 encoding form maintains transparency for all of the ASCII code points (0x00..0x7F). That means Unicode code points U+0000..U+007F are converted to single bytes 0x00..0x7F in UTF-8, All non-surrogate code points between U+0800 and U+FFFF are represented by three bytes; and supplementary code points above U+FFFF require four bytes. Unihan統漢字將中日韓文加以整合分布於U+3400~U+9FFF與U+F900~U+FAFF的空間 34 Windows 2000 and Unicode All of the core function for ––Create windows, displaying text, string manipulation require Unicode string More memory and runs and slower, if you don’t use Unicode from the start 35 Windows CE and Unicode The machines were going to be sold all over the world – Windows CE is natively Unicode A machine with little memory and no disk storage – The ANSI Windows APIs are not support Operating System Description Windows 2000 Unicode & ANSI Windows 98 ANSI only Windows CE Unicode only After XP is now recommended that developers make all their applications using the Unicode versions of the APIs. But you may say, "if I do that my application will not run under Windows 95, 98 and ME because those Windows versions do not support the Unicode APIs". Well this is where the Microsoft Layer for Unicode (or "mslu") comes in. The mslu is contained in a Dll called "unicows.dll". This is redistributable, so the intention is that you will ship this with your executable for placement in the same folder as your executable. 36 C++ 怎麼轉換Unicode和ANSI 37 38 MultiByteToWideChar http://msdn.microsoft.com/en-us/library/windows/desktop/dd319072(v=vs.85).aspx 39 40 Unicode String and ANSI String 41 42 43 ANSI Version 44 Big5 編碼 轉成 Unicode 45 Convert ANSI to Unicode 46 Glyph Rendering • Automatic context analysis: There is only one key for Arabic "b". The system automatically selects whether the isolate, initial, medial or final form of "b" is appropriate, and changes this if you e.g. add another character afterwards. Notice that only the letter value "b" is stored on disk, not the form: this is only selected dynamically on display. http://www.smi.uib.no/ksv/ArabicMac.html#uni 47 Writing Direction (bidirectional) letters, punctuation, symbols, and diacritics Hebrew and Arabic, characters are arranged from right to left into lines, although digits run the other way, making the scripts inherently bidirectional. Left-to-right and right-to-left scripts are frequently used together. In such a case, arranging characters into lines becomes more complex. The Unicode Standard defines an algorithm to determine the layout of a line. See Unicode Standard Annex #9, “The Bidirectional Algorithm,” for more information. http://www.unicode.org/versions/Unicode4.0.0/ch02.pdf#G18421 48 http://en.wikipedia.org/wiki/Help:Arabic 49 Sequence of Base Characters and Diacritics The sequence of Unicode characters U+0061 “a” + U+0308 unambiguously encodes “äu” not “aü”. + U+0075 “u” 50 51 52 53 Unicode Bidirectional Algorithm http://unicode.org/reports/tr9/ 54 55 我 – u6211 愛 - u611b 你 – u4f60 http://blog.163.com/guoo1230@126/blog/static/321155112011328102542586/ Why? U+0000 to U+007F: Basic Latin U+0370 to U+03FF: Greek and Coptic U+1400 to U+167F: Unified Canadian Aboriginal Syllabics 56 U+4E00 to U+9FFF: CJK Unified Ideographs • UTF编码有个优点,即尽管编码字节数不等, 但是不像gb2312/gbk编码一样,需要从文本开 始寻找,才能正确对汉字进行定位。在UTF编 码下,根据 相对固定的算法,从当前位置就 能够知道当前字节是否是一个代码点的开始还 是结束,从而相对简单的进行字符定位。不过 定位问题最简单的还是UTF- 32,它根本不需要 进行字符定位,但是相对的大小也增加不少。 57