Unicode

advertisement
Must Know about Unicode
Vinson Hsieh
1
如果不知道你拿到的字串是什麼
encoding其實你不該寫code,
直到你懂為止
2
ASCII
ANSI
Unicode
3
世界的演變
• When Unix was being invented and K&R (Brian
Kernighan and Dennis Ritchie) were writing The C
Programming Language, everything was very simple.
• The only characters that mattered were good old
unaccented English letters, we had a code for them
called ASCII which was able to represent every
character using a number between 32 and 127 . This
could conveniently be stored in 7 bits.
• Codes below 32 were called unprintable . They were
used for control characters, like 7 which made your
computer beep and 12 which caused the current page
of paper to go flying out of the printer and a new one
to be fed in.
4
ASCII
The lower 128 (codes 0-127) are the most often used codes. Early email
systems in fact would only allow you to transmit characters 0-127 (i.e. "7-bit
text")
5
Plain text = ASCII = Characters 8 bits
• Most computers in those days were using 8bit bytes, so not only could you store every
possible ASCII character, but you had a whole
bit to spare.
• 『gosh, we can use the codes 128-255 for our
own purposes.』 The trouble was, lots of
people had this idea at the same time, and
they had their own ideas of what should go
where in the space from 128 to 255.
6
• The IBM-PC had something that came to be
known as the OEM character set which
provided some accented characters for
European languages and a bunch of line
drawing characters (Dos時代畫表格)
IBM PC Code Page 850
7
Buying PCs outside of America
• For example on some PCs the character code
130 would display as é, but on computers sold
in Israel it was the Hebrew letter Gimel ( ). In
many cases, such as Russian, there were lots
of different ideas of what to do with the
upper-128 characters, so you couldn’t even
reliably interchange Russian documents.
8
ANSI standard
• Eventually this OEM free-for-all got codified in
the ANSI standard.
• Everybody agreed on what to do below 128,
which was pretty much the same as ASCII, but
there were lots of different ways to handle the
characters from 128 and on up, depending on
where you lived.
• These different systems (國家/單位) were
called code pages.
9
128 to 255才128個怎麼夠中國字用?
Big 5?
10
DBCS
• Asian alphabets have thousands of letters
8bits
• This was usually solved by the messy system called
DBCS, the 『double byte character set』
• Visual C++ 裡,MBCS 永遠是指 DBCS
• 65536 可以表達六萬多個字
11
秦代的《倉頡》、《博學》、《爰歷》三篇共有3300
字,漢代揚雄作《訓纂篇》,有5340字,到許慎作
《説文解字》就有9353字了,晉宋以後,文字又日漸
增繁。據唐代封演《聞見記文字篇》所記晉呂忱作
《字林》,有12824字,後魏楊承慶作《字統》,有
13734字,梁顧野王作《玉篇》有16917字。唐代孫強
增字本《玉篇》有22561字。到宋代司馬光修《類篇》多
至31319字,到清代《康熙字典》就有47000多字了。1915
年歐陽博存等的《中華大字典》,有48000多字。1959年
日本諸橋轍次的《大漢和辭典》,收字49964個。1971年
張其昀主編的《中文大辭典》,有49888字
1990年徐仲舒主編的《漢語大字典》,收字數為54678個。1994年
冷玉龍等的《中華字海》,收字數更是驚人,多達85000字。
幸好《中華字海》一類字書裏收錄的漢字絕大部分是“死字”,
也就是歷史上存在過而今天的書面語裏已經廢置不用的字。
12
Shift-JIS Kanji Table
Multibyte Character Sets take advantage of the
fact that only the first 128 characters of the
ASCII set are commonly used (codes 0-127 in
decimal, or 0x00-0x7f in hex). When parsing
Shift-JIS, if you get a byte in the range 0x800xff, you know it is the first character of a two
code sequence. Else, it is a single byte of
regular ASCII.
http://www.rikai.com/library/kanjitables/kanji_codes.sjis.shtml
13
Character based applications use whichever
code page is set as the active "OEM" (aka "MSDOS") code page and Win32 applications use
whichever code page is set as the active "ANSI"
code page. (Note that Windows "ANSI" code
pages do not necessarily map to official ANSI
standard character sets.)
cp437
14
http://www.sqlsnippets.com/en/topic-13410.html
ASCII = OEM character sets = MS-DOS
ANSI = MBCS = DBCS = Windows
15
Can’t type Chinese now
16
Python Win32 Console (DBCS)
Big 5 Code Table
0 1 2 3 4 5 6 7 8 9 a
A7D0 役 忘 忌 志 忍 忱 快 忸 忪 戒 我
0 1 2
B750 感 想 愛
0 1
我愛你 should be \xa7\xda\xb7\x52\xa7\x41
A740 作 你
In ASCII, 52 = R, 41 = A
So become to \xa7\xda\xb7R\xa7A
(7F之前都會mapping到ASCII的0-127)
看起來\x會把後面兩個湊成一個字
17
「許功蓋」(DBCS)
最常見字:功餐許蓋閱
次常見字:擺珮豹枯淚穀愧
http://www.khngai.com/chinese/charmap/tblbig.php
ASCII(5C) == “\”
A45C么 AE5C娉 B85C稞 C25C擺 A55C功 AF5C珮 B95C鈾 C35C黠 A65C吒 B05C豹
BA5C暝 C45C孀 A75C吭 B15C崤 BB5C蓋 C55C髏 A85C沔 B25C淚 BC5C墦 C65C躡
A95C坼 B35C許 BD5C穀 AA5C歿 B45C廄 BE5C閱 AB5C俞 B55C琵 BF5C璞 AC5C枯
B65C跚 C05C餐 AD5C苒 B75C愧 C15C縷
ASCII(7C) == “|”
AA7C泜 B47C揉 A87C育 BE7C魯 B27C琍 BC7C慝 C67C鸛 A97C尚 B37C逖 BD7C罵
A77C坑 B17C悴 BB7C誡 C57C疊 A67C帆 B07C院 BA7C漏 C47C辮 AB7C咽 B57C稅
BF7C糕 AC7C洱 B67C閏 C07C嚐 AD7C迢 B77C會 C17C舉 A47C弋 AE7C徑 B87C腮
C27C甕 A57C四 AF7C砝 B97C頌 C37C牘
Python 會把 ‘\’ 變成 ‘\\’,還不錯 ,可以翻回5C
18
Shift-JIS Kanji Table 5C/7C
http://www.chi2ko.com/jingyan/shiftjis2uni.htm
19
How about move strings to another PC
• Of course, as soon as the Internet happened,
it became quite commonplace to move strings
from one computer to another, and the whole
mess came tumbling down.
• Win95/98 時代
20
Windows 98
It has 16-bit Windows heritage
– Almost everything using ANSI strings
21
Unicode
• Unicode 只是一個字形和內碼上的標準,
並沒有定義實際在電腦上存取的方法,因
此Unicode協會便定義了一整套的電腦存取
Unicode編碼的轉換格式,並考慮了與其它
編碼方式兼容,稱之為UTF(Unicode/UCS
Transformation Format,統一碼/通用字集
變換格式)。UTF8/16/32。
22
Unicode Code Point Chart
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
U+0000 to U+007F: Basic Latin
U+0080 to U+00FF: Latin-1 Supplement
U+0100 to U+017F: Latin Extended-A
U+0180 to U+024F: Latin Extended-B
U+0250 to U+02AF: IPA Extensions
U+02B0 to U+02FF: Spacing Modifier Letters
U+0300 to U+036F: Combining Diacritical Marks
U+0370 to U+03FF: Greek and Coptic
U+0400 to U+04FF: Cyrillic
U+0500 to U+052F: Cyrillic Supplement
U+0530 to U+058F: Armenian
U+0590 to U+05FF: Hebrew
U+0600 to U+06FF: Arabic
U+0700 to U+074F: Syriac
U+0750 to U+077F: Arabic Supplement
U+0780 to U+07BF: Thaana
U+0900 to U+097F: Devanagari
…
http://inamidst.com/stuff/unidata/
23
Unicode terminology
Sample Unicode Symbols
03A0
Π
Greek Capital Letter Pi
03A3
Σ
Greek Capital Letter Sigma
03A9
Ω
Greek Capital Letter Omega
notation U+NNNN
uni = {U+03A0} + {U+03A3} + {U+03A9} (ΠΣΩ)
24
Now, even though we know exactly what 'uni'
represents (ΠΣΩ) note that there is no way to:
1.
2.
3.
4.
Print uni to the screen.
Save uni to a file.
Add uni to another piece of text.
Tell me how many bytes it takes to store uni.
25
Valid Coding of Ω
Encoding name
Binary representation
ISO-8859-7 (OEM/ASCII)
\xD9
"Native" Greek encoding
UTF-8
\xCE\xA9
UTF-16
\xFF\xFE\xA9\x03
UTF-32
\xFF\xFE\x00\x00\xA9\x03\x00\x00
You should think of Unicode as symbols (Ω), not as bytes.
26
Converting Unicode symbols to Python literals
Pseudocode:
uni = ‘abc_’ + {U+03A0} + {U+03A3} + {U+03A9} + ‘.txt’
Here is how you make that string in Python:
uni = u"abc_\u03a0\u03a3\u03a9.txt"
Pseudocode:
uni = {U+1A} + {U+B3C} + {U+1451} + {U+1D10C}
Python:
uni = u'\u001a\u0bc3\u1451\U0001d10c’
Python:
uni = u'\u001A\u0BC3\u1451\U0001D10C'
27
Codecs
• Unicode objects have no fixed computer
representation.
• Before an Unicode object can be printed,
stored to disk, or sent across a network, it
must be encoded into a fixed computer
representation. This is done using a codec.
Some popular codecs you may have heard
about in your day to day
experiences: ASCII, iso-8859-7,UTF-8, UTF-16.
28
轉換的正確觀念
• ANSI 和 Unicode間的轉換
• Big5  Unicode  utf8/16/32
• utf8/16/32  Unicode  Big5
29
http://www.unicode.org/versions/Unicode4.0.0/ch02.pdf#G18421
30
Unicode字元平面映射
http://zh.wikipedia.org/wiki/%E5%9F%BA%E6%9C%AC%E5%A4%9A%E6%96%87%E7
%A8%AE%E5%B9%B3%E9%9D%A2#.E5.9F.BA.E6.9C.AC.E5.A4.9A.E6.96.87.E7.A7.8D.
E5.B9.B3.E9.9D.A2
31
UTF 32 (Always 4 bytes)
 UTF-32 - Each Unicode code point is represented
directly by a single 32-bit code unit
 UTF-32 is restricted to representation of code
points in the range 0..10FFFF16—that is, the
Unicode codespace
 UTF-32 may be a preferred encoding form where
memory or disk storage space for characters is no
particular concern, but where fixed-width, single
code unit access to characters is desired. UTF-32
is also a preferred encoding form for processing
characters on most Unix platforms.
32
UTF 16 ( 2 or 4 bytes)
code points in the range U+0000..U+FFFF are represented
as a single 16-bit code unit; code points in the
supplementary planes, in the range U+10000..U+10FFFF,
are instead represented as pairs of 16-bit code units. These
pairs of special code units are known as surrogate pairs.
33
UTF 8 (1 – 4 bytes)
 The UTF-8 encoding form maintains transparency for all of the ASCII code
points (0x00..0x7F). That means Unicode code points U+0000..U+007F are
converted to single bytes 0x00..0x7F in UTF-8,
 All non-surrogate code points between U+0800 and U+FFFF are represented
by three bytes; and supplementary code points above U+FFFF require four
bytes.
Unihan統漢字將中日韓文加以整合分布於U+3400~U+9FFF與U+F900~U+FAFF的空間
34
Windows 2000 and Unicode
All of the core function for ––Create windows, displaying text, string
manipulation require Unicode string
More memory and runs and slower, if you don’t use Unicode from the start
35
Windows CE and Unicode
The machines were going to be sold all over the world
– Windows CE is natively Unicode
A machine with little memory and no disk storage
– The ANSI Windows APIs are not support
Operating System
Description
Windows 2000
Unicode & ANSI
Windows 98
ANSI only
Windows CE
Unicode only
After XP is now recommended that developers make all their applications using the
Unicode versions of the APIs. But you may say, "if I do that my application will not run
under Windows 95, 98 and ME because those Windows versions do not support the
Unicode APIs". Well this is where the Microsoft Layer for Unicode (or "mslu") comes
in. The mslu is contained in a Dll called "unicows.dll". This is redistributable, so the
intention is that you will ship this with your executable for placement in the same
folder as your executable.
36
C++ 怎麼轉換Unicode和ANSI
37
38
MultiByteToWideChar
http://msdn.microsoft.com/en-us/library/windows/desktop/dd319072(v=vs.85).aspx
39
40
Unicode String and ANSI String
41
42
43
ANSI Version
44
Big5 編碼 轉成 Unicode
45
Convert ANSI to Unicode
46
Glyph Rendering
• Automatic context analysis: There is only one
key for Arabic "b". The system automatically
selects whether the isolate, initial, medial or
final form of "b" is appropriate, and changes
this if you e.g. add another character
afterwards. Notice that only the letter value
"b" is stored on disk, not the form: this is only
selected dynamically on display.
http://www.smi.uib.no/ksv/ArabicMac.html#uni
47
Writing Direction (bidirectional)
letters, punctuation, symbols, and diacritics
Hebrew and Arabic, characters are arranged from
right to left into lines, although digits run the other
way, making the scripts inherently bidirectional.
Left-to-right and right-to-left scripts are frequently
used together. In such a case, arranging characters
into lines becomes more complex. The Unicode
Standard defines an algorithm to determine the
layout of a line. See Unicode Standard Annex #9,
“The Bidirectional Algorithm,” for more information.
http://www.unicode.org/versions/Unicode4.0.0/ch02.pdf#G18421
48
http://en.wikipedia.org/wiki/Help:Arabic
49
Sequence of Base Characters and
Diacritics
The sequence of Unicode characters U+0061 “a” + U+0308
unambiguously encodes “äu” not “aü”.
+ U+0075 “u”
50
51
52
53
Unicode Bidirectional Algorithm
http://unicode.org/reports/tr9/
54
55
我 – u6211
愛 - u611b
你 – u4f60
http://blog.163.com/guoo1230@126/blog/static/321155112011328102542586/
Why?
U+0000 to U+007F: Basic Latin
U+0370 to U+03FF: Greek and Coptic
U+1400 to U+167F: Unified Canadian Aboriginal Syllabics
56
U+4E00 to U+9FFF: CJK Unified Ideographs
• UTF编码有个优点,即尽管编码字节数不等,
但是不像gb2312/gbk编码一样,需要从文本开
始寻找,才能正确对汉字进行定位。在UTF编
码下,根据 相对固定的算法,从当前位置就
能够知道当前字节是否是一个代码点的开始还
是结束,从而相对简单的进行字符定位。不过
定位问题最简单的还是UTF- 32,它根本不需要
进行字符定位,但是相对的大小也增加不少。
57
Download