Chinese Character Encoding

advertisement
101035 中文信息处理
Chinese NLP
Lecture 2
1
字——中文编码
Chinese Character Encoding
• 中文字符集(Character Set)
• 中文编码集(Code Set)
• 基本编码方式
• 中文编码方式
• 国际编码方式
2
中文字符集
Chinese Character Set
• Character Set(字符集)
•
A character set is a collection of characters.
•
•
•
{a, b, c, …, z, A, B, C, …, Z, 0, 1, 2, …, 9} is an English character set,
{啊, 阿, 唉, …, 作, 坐, 座} is a Chinese character set.
Each character set has a name, such as ASCII or KANG XI (康熙)
There are more than one Chinese character set, over time and
cross regions.
3
•
Chinese Character Set
•
GB
•
•
•
•
GB is short for 国家标准 and means National Standard.
Countries such as China and Singapore are using this standard.
Big5
•
•
•
They are developed in Mainland China and are based on simplified
Chinese characters.
Big5 is the most widely implemented character set standard used in
Taiwan and is used for traditional Chinese characters.
Countries such as Taiwan and Hong Kong are using this standard.
ISO 10646-1 and Unicode
•
•
ISO and Unicode Consortium jointly develop a multilingual character
set to combine the majority of the world’s character sets into a large
repertoire of characters.
Simplified/Traditional Chinese, Korean and Japanese characters can be
displayed on the same HTML pages
4
中文编码集
Chinese Code Set
•
Code Set(编码集)
•
•
•
•
•
Code set means “coded character set”.
Encoding of a character set is to represent its characters in bytes or
bits.
The complete set of numerical values is called code space (denoted
by CODE).
A value in code space is called a code (or a code point).
Encoding is a mapping of a (unique) character (in a character set) to
a (unique) code (in a code space).
5
• Chinese Code Set
•
A coded character set, denoted by CC, is a set of tuples, CC={(ci,
codei) | ciC, codeiCODE}, where codei codej if ci  cj .
•
•
•
•
For example, C={中文计算}, C can be encoded with different code
spaces.
If CODE1={00, 01, 10, 11}, CC1={(中, 00), (文, 01), (计, 10), (算, 11)}.
If CODE2={0000, 0001, 0010, 0011}, CC2={(中, 0000), (文, 0001), (计,
0010), (算, 0011)}.
If CODE3 ={1000, 1001, 1001, 1011}, CC3 ={(中, 1000), (文, 1001), (计,
1001), (算, 1011)}.
6
In-Class Exercise
•
What binary values can be assigned to these 6 characters
according to this code space?
(Tip: at first,
Two Dimensional Code Space (66)
how many
1 啊 阿 唉
bits do you
2
need to
3
encode 6
4
rows and 6
5
columns?)
作坐座
6
1
2
3
4
5
6
7
基本编码方式
Basic Encoding Method
• The mapping of a character in a character set to a code
point in code space is called code point assignment.
• An encoding method explains how a character is being
mapped into a code point and also how assignments are
made to identify a mixture of difference code sets.
CC1
CC2
CC3
CC4
8
•
ASCII
•
•
•
•
A popular encoding scheme for English characters is called ASCII
(American Standard Code for Information Interchange).
It defines 128 character code points (from 0x00 to 0x7F), of which
the first 32 are control codes (non-printable) from 0x00 to 0x1F and
the other 96 are graphic (printable) characters from 0x20 to 0x7F.
But actually, only 94 are printable (0x21-0x7E).
Values are represented with only 7 bits (the first bit is 0).
9
•
low-bits
ASCII
0000 0001 0010 0011 0100 0101 0110 0111 1000 1001 1010 1011 1100 1101 1110 1111
high-bits
0000
0001
0010
0011
0100
0101
0110
0111
10
中文编码方式
Chinese Encoding Method
• It takes 2 bytes, or 16 bits, to encode all the Chinese
characters.
• However, not all of these 256×256 points are used for
representing displayable
characters. Generally, 94×94
is considered for a Chinese
character encoding matrix.
11
• One Byte English Characters vs. Two Byte Chinese
Characters.
• High-Bit-On Scheme
English
0 x x x x x x x
Code range is 33-126 (<128) or 0x21-0x7E.
Chinese
1 x x x x x x x
x x x x x x x x
Code range of the first byte is 161-254 (>128) or
0xA1-0xFE.
12
• Chinese Characters or English Characters?
0xAB=171>
128
0x41=65<
128
0x42=66<
128
0x43=67<
128
0xA4=164>
128
AB AC 41 42 43 A4 40
ABAC is a
Chinese
Character
41 is a
English
Character
42 is a
English
Character
43 is a
English
Character
A440 is a
Chinese
Character
13
• There are two encoding methods that are common to
many character sets in China, Taiwan and other Asia
countries.
•
ISO-2022 and EUC
• They are locale-independent encoding methods.
• However, the exact definitions of them depend greatly on
the locale. In other words, there are locale-specific
instances of these encodings, e.g.
•
•
ISO-2022-CN, ISO-2022-CN-EXT, …
EUC-CN, EUC-TW, …
14
• Encoding method and character set
Encoding Method
ASCII
Supported Character Sets
ASCII, GB-Roman, CNS-Roman …
ISO-2022
ASCII, GB-Roman, CNS-Roman, GB 2312-80, CNS
11643-1992 …
EUC
ASCII, GB-Roman, CNS-Roman, GB 2312-80, CNS
11643-1992 …
ASCII
Encoding
Range
Control character
00-1E
0-31
Graphic characters
21-7E
33-126
Space character
20
32
Delete characters
7F
127
94 printable
characters
15
• ISO-2022
•
•
•
ISO-2022 is a modal encoding, which uses escape sequences or
other special characters to switch between different modes
(one-byte vs. two-byte).
It is used primarily as an information interchange code for
moving text between computer systems, such as Email.
It is also often referred to as a
seven-bit encoding methods,
because all the bytes used to
represent characters do not have
their eighth-bit enabled.
16
•
ISO-2022-CN (-EXT)
•
•
ISO-2022-CN (-EXT) is a locale-specific implementation of ISO-2022.
It is achieved through the use of designators and shifts.
•
•
•
•
•
•
Designator specifies the character set associated with a particular shift.
There are four shifts, SI, SO, SS2 and SS3. Shift specify how to interpret
the subsequent bytes.
Each line starts in ASCII, and ends in ASCII.
A shifting character, indicated by SO (0x0E) or SI (0x0F) switches
between one-byte and two-byte modes.
SO (Shift Out) invokes two-byte mode (for GB 2312-80 and CNS
11643-1992 Plane 1) for the following bytes until SI (Shift In) is
encountered which invokes one-byte mode.
There must be a shift back to ASCII (by SI) before the end of the line.
17
• ISO-2022-CN (-EXT)
•
•
A single shift sequence, indicated by SS2 (0x1B 0x4E) or SS3
(0x1B 0x4F), invokes two-byte mode only for the following two
bytes, and is typically employed for rarely-used character sets.
Shifting Types
invoked Character Sets
SO
GB 2312-80, CNS 11643-1992 Plane 1
SS2
CNS 11643-1992 Plane 2
SS3
CNS 11643-1992 Planes 3-7
A designator (escape) sequence indicates what character set
should be invoked when in two-byte mode, e.g.,
•
0x1B 0x24 0x29 0x41 (<ESC> $ ) A in ASCII) indicates GB 2312-80.
18
• Designator and shift
Designate
CNS-11643
plane 1
ASCII
code
CNS Plane 1
code
ASCII
code
CNS Plane 1
code
1B 24 29 47 31 30 0E 45 4C 0F 31 38 0E 45 4A 0F
one byte
mode
SO Shift to
two byte
mode
SI Shift to
one byte
mode
SO Shift to
two byte
mode
SI Shift to
one byte
mode
19
• EUC
•
•
•
EUC (Extended Unix Code) encoding is implemented as the
internal code for most Unix software configured to support
Japanese.
Although U represents Unix, this encoding is commonly used on
other platforms, such as Windows and Mac OS.
The full definition of EUC encoding consists of four code sets.
•
•
Code set 0 is always set to the ASCII character set or a
country’s own version thereof.
The remaining code sets are defined as a set of variants
from which each country can select.
20
• EUC-CN
•
EUC-CN is a locale-specific implementation of EUC.
Code set 0
Byte range
21-7E
33-126
Code set 1
First byte range
Second byte range
A1-FE
A1-FE
161-254
161-254
94
94
EUC-CN (GB)
1 x x x x x x x
1 x x x x x x x
Code range of both the first byte and the second
byte is 161-254 (>128) or 0xA1-0xFE.
21
• EUC-TW
•
EUC-TW is by far the most complex implementation of EUC
encoding in terms of how many characters it encodes, i.e. close
to 50,000 characters.
22
• ISO-2022 vs EUC
•
•
EUC encoding is closely related to ISO-2022.
In fact, every character that can be encoded by ISO-2022 can be
converted to an EUC-encoded equivalent.
Locale (Character Set)
ISO-2022-CN
EUC-CN or EUCTW Set 1
China (GB 2312-80)
汉
字
3A3A
5756
BABA
D7D6
Taiwan (CNS 11643-1992)
漢
字
6947
4773
E9C7
C7F3
23
• GBK
•
•
GBK encoding is implemented as the internal code for the
Chinese (PRC) version of Microsoft’s Windows and IBM’s OS/2.
GBK character set contains 21,886 Symbols and Chinese
characters.
ASCII or GB-Roman
Byte range
GBK
First byte range
Second byte ranges
21-7E
33-126
81-FE
40-7E, 80-FE
129-254
64-126, 128-254
24
• GBK
•
One of the design principle of GBK is that it should be fully
compatible with GB2312 and extend to support Unicode which
has 20,902 characters in its first version.
25
• Big5
•
Big5 encoding range has a lot in common with EUC-TW code
sets 0 and 1; the main difference being that there is an additional
encoding block.
ASCII or CNS-Roman
Byte range
21-7E
33-126
A1-FE
40-7E, A1-FE
161-254
64-126, 161-254
Big5
First byte range
Second byte ranges
26
• Big5
•
Big5 is the most widely implemented character set standard
used in Taiwan and is used for traditional Chinese characters.
27
国际编码方式
International Encoding Method
• Unicode and ISO 10646-1.
•
We need to develop a multilingual character set combining the
majority of the world’s writing systems and character sets into a
Universal Character Set (UCS) or Unicode.
Character Set
Encoding Method
Unicode and
ISO 10646-1
UCS-2, UCS-4
UTF-7,
UTF-8,
UTF-16
28
• BMP
•
The first plane (plane 0), the Basic Multilingual Plane (BMP), is
where most characters have been assigned so far. The BMP
contains characters for almost all modern languages, and a large
number of special characters.
29
• UCS-2
•
•
A 16-bit representation can end up to 65,536 (=216) unique code
points.
It allocates the entire encoding space for characters (0x00000xFFFF).
UCS-2
First byte range
Second byte range
•
00-FF
00-FF
0-255
0-255
UCS-2 and Unicode encodings are identical for most of Chinese
characters.
30
• UCS-4
•
•
It is a four byte (actually, 31-bit) representation which can
encode up to 2,147,483,648 (=231) code points (0x000000000x7FFFFFFF).
It allocates the entire encoding space for characters (0x00000xFFFF).
UCS-4
First byte range
Second byte range
Third byte range
Fourth byte range
00-7F
00-FF
00-FF
00-FF
0-127
0-256
0-256
0-256
31
• UCS-2 vs UCS-4
65,536
code points
UCS-2
(16-bit)
0000
…
…
…
FFFF
Can only encode
BMP Plane
Unicode
(17-plane)
17
256256=
1,114,112
characters
UCS-4
(31-bit)
00000000
…
…
…
0000FFFF
00010000
…
…
…
7FFFFFFF
2,147,483,648
code points
Sufficient to encode
all 17 Planes of Unicode Set
32
• UTF-16
•
•
•
In essence, UTF-16 encodes the BMP according to UCS-2 (16 bits)
encoding (compatible).
But it also allows the next 16 planes, which are normally only
accessible through UCS-4 (32 bits) encoding.
The surrogates area is defined with UTF-16 to allow for
expansion beyond the 16-bit code space.
33
• UCS-2 vs UCS-4 vs UTF-16
UCS-2
65,536
code points
0000
…
…
…
FFFF
Can only encode
BMP Plane
D800 DC00
UTF-16
…
Surrogates
DBFF DFFF
Unicode
Plane 0
Plane 1
…
Plane 16
U+10000
…
U+10FFFF
Scalar Value
Denoted by U+
UCS-4
00000000
…
…
…
0000FFFF
00010000
…
0010FFFF
…
…
7FFFFFFF
2,147,483,648
code points
Sufficient to encode
all 17 Planes of Unicode Set
34
• Base64
•
64 characters are used, they are the upper-case and lower-case
Roman alphabet characters (i.e. A-Z, a-z), the numerals (0-9),
and the "+" and "/" symbols.
35
• Base64
36
• Base64
•
•
•
Step 1: Base64 takes every three bytes (each consisting of eight
bits), and convert it to four six-bits.
Step 2: Each six-bit segment is then converted into a character in
the Base64 character set.
Step 3: If the size of the original data in bytes is not a multiple of
three, we append enough bytes with a value of “0” to create a 3byte group. The Base64 padding character is “=”.
00100101
001001
10110100
011011
JbRp
010001
01101001
101001
01000001
010000
00000000 00000000
010000
000000
000000
QQ==
37
In-Class Exercise
•
What is the result of applying Base64 to three Hex characters
BEAE, CED3 and B7F5 (it is a Japanese name, 小林剑)?
(Please first convert the Hex to Bin.)
38
• UTF-7
•
•
UTF-7 uses the same set of Base64 character set.
UTF-7 is different from Base64 in that:
•
•
•
The “padding” character is not necessary.
The Base64-like transformation is applied only to specific
characters
Those characters that require Base64 transformation according
to UTF-7 encoding begin with a “plus” character (+, 0x2B) and
end with a “hyphen” (-, 0x2D) character.
Character String
M
y
河
UCS-2 Encoding
004D 0079 0020 6CB3 8C5A
UTF-7 Encoding
M(4D) y(79)
ASCII Codes
豚
20 +bLOMWg39
• UTF-8
•
UFT-8 encoding is developed as a way to represent Unicode text
as a stream of one or more eight-bits, rather than a true 16-bit
units.
•
•
•
•
•
It converts UCS-2 into a mixed one- through three-byte encoding.
It converts UCS-4 into a mixed one- through six-byte encoding.
It converts UTF-16 into a mixed one- through four-byte encoding.
It is therefore an eight-bit and variable-length encoding.
UTF-8 is the de facto standard encoding for interchange of
Unicode text.
40
• UTF-8
•
Encoding Templates
•
•
For all but the ASCII-compatible range, the number of first-byte
high-order bits set to “1” indicates the byte length.
Filling the templates from the rightmost side bits.
UCS-2 Range
UTF-8 Bit Arrays
0000-007F
0xxxxxxx
0080-07FF
110xxxxx 10xxxxxx
0800-FFFF
1110xxxx 10xxxxxx 10xxxxxx
Unicode Range (+)
UTF-8 Bit Arrays
0001 0000 – 001F FFFF
11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
UCS-4 Range (+)
UTF-8 Bit Arrays
0020 0000 – 03FF FFFF
111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
0400 0000 – 7FFF FFFF
1111110x 10xxxxxx 10xxxxxx
10xxxxxx 10xxxxxx 10xxxxxx
41
• UTF-8 Example: Convert a Unicode character “茶” into
UTF-8 code.
•
•
•
Step 1” the Unicode value of 茶 (tea) is 8336. So we need 3 bytes.
Step 2: the binary form of hexadecimal 8336 is 1000 001100
110110.
Step 3: Fill the empty slots of the three-byte template with the
binary value of and get:
1110xxxx 10xxxxxx 10xxxxxx
11101000 10001100 10110110
•
Step 4: UTF-8 code value is thus E8 8C B6.
42
Wrap-Up
•
•
•
•
中文字符集
中文编码集
基本编码方式
•
ASCII
中文编码方式
•
•
•
•
ISO-2022
EUC
GBK
BIG5
•
国际编码方式
•
•
•
•
•
•
•
•
Unicode
ISO 10646-1
UCS-2
UCS-4
UTF-16
Base64
UTF-7
UTF-8
43
Download