101035 中文信息处理 Chinese NLP Lecture 2 1 字——中文编码 Chinese Character Encoding • 中文字符集(Character Set) • 中文编码集(Code Set) • 基本编码方式 • 中文编码方式 • 国际编码方式 2 中文字符集 Chinese Character Set • Character Set(字符集) • A character set is a collection of characters. • • • {a, b, c, …, z, A, B, C, …, Z, 0, 1, 2, …, 9} is an English character set, {啊, 阿, 唉, …, 作, 坐, 座} is a Chinese character set. Each character set has a name, such as ASCII or KANG XI (康熙) There are more than one Chinese character set, over time and cross regions. 3 • Chinese Character Set • GB • • • • GB is short for 国家标准 and means National Standard. Countries such as China and Singapore are using this standard. Big5 • • • They are developed in Mainland China and are based on simplified Chinese characters. Big5 is the most widely implemented character set standard used in Taiwan and is used for traditional Chinese characters. Countries such as Taiwan and Hong Kong are using this standard. ISO 10646-1 and Unicode • • ISO and Unicode Consortium jointly develop a multilingual character set to combine the majority of the world’s character sets into a large repertoire of characters. Simplified/Traditional Chinese, Korean and Japanese characters can be displayed on the same HTML pages 4 中文编码集 Chinese Code Set • Code Set(编码集) • • • • • Code set means “coded character set”. Encoding of a character set is to represent its characters in bytes or bits. The complete set of numerical values is called code space (denoted by CODE). A value in code space is called a code (or a code point). Encoding is a mapping of a (unique) character (in a character set) to a (unique) code (in a code space). 5 • Chinese Code Set • A coded character set, denoted by CC, is a set of tuples, CC={(ci, codei) | ciC, codeiCODE}, where codei codej if ci cj . • • • • For example, C={中文计算}, C can be encoded with different code spaces. If CODE1={00, 01, 10, 11}, CC1={(中, 00), (文, 01), (计, 10), (算, 11)}. If CODE2={0000, 0001, 0010, 0011}, CC2={(中, 0000), (文, 0001), (计, 0010), (算, 0011)}. If CODE3 ={1000, 1001, 1001, 1011}, CC3 ={(中, 1000), (文, 1001), (计, 1001), (算, 1011)}. 6 In-Class Exercise • What binary values can be assigned to these 6 characters according to this code space? (Tip: at first, Two Dimensional Code Space (66) how many 1 啊 阿 唉 bits do you 2 need to 3 encode 6 4 rows and 6 5 columns?) 作坐座 6 1 2 3 4 5 6 7 基本编码方式 Basic Encoding Method • The mapping of a character in a character set to a code point in code space is called code point assignment. • An encoding method explains how a character is being mapped into a code point and also how assignments are made to identify a mixture of difference code sets. CC1 CC2 CC3 CC4 8 • ASCII • • • • A popular encoding scheme for English characters is called ASCII (American Standard Code for Information Interchange). It defines 128 character code points (from 0x00 to 0x7F), of which the first 32 are control codes (non-printable) from 0x00 to 0x1F and the other 96 are graphic (printable) characters from 0x20 to 0x7F. But actually, only 94 are printable (0x21-0x7E). Values are represented with only 7 bits (the first bit is 0). 9 • low-bits ASCII 0000 0001 0010 0011 0100 0101 0110 0111 1000 1001 1010 1011 1100 1101 1110 1111 high-bits 0000 0001 0010 0011 0100 0101 0110 0111 10 中文编码方式 Chinese Encoding Method • It takes 2 bytes, or 16 bits, to encode all the Chinese characters. • However, not all of these 256×256 points are used for representing displayable characters. Generally, 94×94 is considered for a Chinese character encoding matrix. 11 • One Byte English Characters vs. Two Byte Chinese Characters. • High-Bit-On Scheme English 0 x x x x x x x Code range is 33-126 (<128) or 0x21-0x7E. Chinese 1 x x x x x x x x x x x x x x x Code range of the first byte is 161-254 (>128) or 0xA1-0xFE. 12 • Chinese Characters or English Characters? 0xAB=171> 128 0x41=65< 128 0x42=66< 128 0x43=67< 128 0xA4=164> 128 AB AC 41 42 43 A4 40 ABAC is a Chinese Character 41 is a English Character 42 is a English Character 43 is a English Character A440 is a Chinese Character 13 • There are two encoding methods that are common to many character sets in China, Taiwan and other Asia countries. • ISO-2022 and EUC • They are locale-independent encoding methods. • However, the exact definitions of them depend greatly on the locale. In other words, there are locale-specific instances of these encodings, e.g. • • ISO-2022-CN, ISO-2022-CN-EXT, … EUC-CN, EUC-TW, … 14 • Encoding method and character set Encoding Method ASCII Supported Character Sets ASCII, GB-Roman, CNS-Roman … ISO-2022 ASCII, GB-Roman, CNS-Roman, GB 2312-80, CNS 11643-1992 … EUC ASCII, GB-Roman, CNS-Roman, GB 2312-80, CNS 11643-1992 … ASCII Encoding Range Control character 00-1E 0-31 Graphic characters 21-7E 33-126 Space character 20 32 Delete characters 7F 127 94 printable characters 15 • ISO-2022 • • • ISO-2022 is a modal encoding, which uses escape sequences or other special characters to switch between different modes (one-byte vs. two-byte). It is used primarily as an information interchange code for moving text between computer systems, such as Email. It is also often referred to as a seven-bit encoding methods, because all the bytes used to represent characters do not have their eighth-bit enabled. 16 • ISO-2022-CN (-EXT) • • ISO-2022-CN (-EXT) is a locale-specific implementation of ISO-2022. It is achieved through the use of designators and shifts. • • • • • • Designator specifies the character set associated with a particular shift. There are four shifts, SI, SO, SS2 and SS3. Shift specify how to interpret the subsequent bytes. Each line starts in ASCII, and ends in ASCII. A shifting character, indicated by SO (0x0E) or SI (0x0F) switches between one-byte and two-byte modes. SO (Shift Out) invokes two-byte mode (for GB 2312-80 and CNS 11643-1992 Plane 1) for the following bytes until SI (Shift In) is encountered which invokes one-byte mode. There must be a shift back to ASCII (by SI) before the end of the line. 17 • ISO-2022-CN (-EXT) • • A single shift sequence, indicated by SS2 (0x1B 0x4E) or SS3 (0x1B 0x4F), invokes two-byte mode only for the following two bytes, and is typically employed for rarely-used character sets. Shifting Types invoked Character Sets SO GB 2312-80, CNS 11643-1992 Plane 1 SS2 CNS 11643-1992 Plane 2 SS3 CNS 11643-1992 Planes 3-7 A designator (escape) sequence indicates what character set should be invoked when in two-byte mode, e.g., • 0x1B 0x24 0x29 0x41 (<ESC> $ ) A in ASCII) indicates GB 2312-80. 18 • Designator and shift Designate CNS-11643 plane 1 ASCII code CNS Plane 1 code ASCII code CNS Plane 1 code 1B 24 29 47 31 30 0E 45 4C 0F 31 38 0E 45 4A 0F one byte mode SO Shift to two byte mode SI Shift to one byte mode SO Shift to two byte mode SI Shift to one byte mode 19 • EUC • • • EUC (Extended Unix Code) encoding is implemented as the internal code for most Unix software configured to support Japanese. Although U represents Unix, this encoding is commonly used on other platforms, such as Windows and Mac OS. The full definition of EUC encoding consists of four code sets. • • Code set 0 is always set to the ASCII character set or a country’s own version thereof. The remaining code sets are defined as a set of variants from which each country can select. 20 • EUC-CN • EUC-CN is a locale-specific implementation of EUC. Code set 0 Byte range 21-7E 33-126 Code set 1 First byte range Second byte range A1-FE A1-FE 161-254 161-254 94 94 EUC-CN (GB) 1 x x x x x x x 1 x x x x x x x Code range of both the first byte and the second byte is 161-254 (>128) or 0xA1-0xFE. 21 • EUC-TW • EUC-TW is by far the most complex implementation of EUC encoding in terms of how many characters it encodes, i.e. close to 50,000 characters. 22 • ISO-2022 vs EUC • • EUC encoding is closely related to ISO-2022. In fact, every character that can be encoded by ISO-2022 can be converted to an EUC-encoded equivalent. Locale (Character Set) ISO-2022-CN EUC-CN or EUCTW Set 1 China (GB 2312-80) 汉 字 3A3A 5756 BABA D7D6 Taiwan (CNS 11643-1992) 漢 字 6947 4773 E9C7 C7F3 23 • GBK • • GBK encoding is implemented as the internal code for the Chinese (PRC) version of Microsoft’s Windows and IBM’s OS/2. GBK character set contains 21,886 Symbols and Chinese characters. ASCII or GB-Roman Byte range GBK First byte range Second byte ranges 21-7E 33-126 81-FE 40-7E, 80-FE 129-254 64-126, 128-254 24 • GBK • One of the design principle of GBK is that it should be fully compatible with GB2312 and extend to support Unicode which has 20,902 characters in its first version. 25 • Big5 • Big5 encoding range has a lot in common with EUC-TW code sets 0 and 1; the main difference being that there is an additional encoding block. ASCII or CNS-Roman Byte range 21-7E 33-126 A1-FE 40-7E, A1-FE 161-254 64-126, 161-254 Big5 First byte range Second byte ranges 26 • Big5 • Big5 is the most widely implemented character set standard used in Taiwan and is used for traditional Chinese characters. 27 国际编码方式 International Encoding Method • Unicode and ISO 10646-1. • We need to develop a multilingual character set combining the majority of the world’s writing systems and character sets into a Universal Character Set (UCS) or Unicode. Character Set Encoding Method Unicode and ISO 10646-1 UCS-2, UCS-4 UTF-7, UTF-8, UTF-16 28 • BMP • The first plane (plane 0), the Basic Multilingual Plane (BMP), is where most characters have been assigned so far. The BMP contains characters for almost all modern languages, and a large number of special characters. 29 • UCS-2 • • A 16-bit representation can end up to 65,536 (=216) unique code points. It allocates the entire encoding space for characters (0x00000xFFFF). UCS-2 First byte range Second byte range • 00-FF 00-FF 0-255 0-255 UCS-2 and Unicode encodings are identical for most of Chinese characters. 30 • UCS-4 • • It is a four byte (actually, 31-bit) representation which can encode up to 2,147,483,648 (=231) code points (0x000000000x7FFFFFFF). It allocates the entire encoding space for characters (0x00000xFFFF). UCS-4 First byte range Second byte range Third byte range Fourth byte range 00-7F 00-FF 00-FF 00-FF 0-127 0-256 0-256 0-256 31 • UCS-2 vs UCS-4 65,536 code points UCS-2 (16-bit) 0000 … … … FFFF Can only encode BMP Plane Unicode (17-plane) 17 256256= 1,114,112 characters UCS-4 (31-bit) 00000000 … … … 0000FFFF 00010000 … … … 7FFFFFFF 2,147,483,648 code points Sufficient to encode all 17 Planes of Unicode Set 32 • UTF-16 • • • In essence, UTF-16 encodes the BMP according to UCS-2 (16 bits) encoding (compatible). But it also allows the next 16 planes, which are normally only accessible through UCS-4 (32 bits) encoding. The surrogates area is defined with UTF-16 to allow for expansion beyond the 16-bit code space. 33 • UCS-2 vs UCS-4 vs UTF-16 UCS-2 65,536 code points 0000 … … … FFFF Can only encode BMP Plane D800 DC00 UTF-16 … Surrogates DBFF DFFF Unicode Plane 0 Plane 1 … Plane 16 U+10000 … U+10FFFF Scalar Value Denoted by U+ UCS-4 00000000 … … … 0000FFFF 00010000 … 0010FFFF … … 7FFFFFFF 2,147,483,648 code points Sufficient to encode all 17 Planes of Unicode Set 34 • Base64 • 64 characters are used, they are the upper-case and lower-case Roman alphabet characters (i.e. A-Z, a-z), the numerals (0-9), and the "+" and "/" symbols. 35 • Base64 36 • Base64 • • • Step 1: Base64 takes every three bytes (each consisting of eight bits), and convert it to four six-bits. Step 2: Each six-bit segment is then converted into a character in the Base64 character set. Step 3: If the size of the original data in bytes is not a multiple of three, we append enough bytes with a value of “0” to create a 3byte group. The Base64 padding character is “=”. 00100101 001001 10110100 011011 JbRp 010001 01101001 101001 01000001 010000 00000000 00000000 010000 000000 000000 QQ== 37 In-Class Exercise • What is the result of applying Base64 to three Hex characters BEAE, CED3 and B7F5 (it is a Japanese name, 小林剑)? (Please first convert the Hex to Bin.) 38 • UTF-7 • • UTF-7 uses the same set of Base64 character set. UTF-7 is different from Base64 in that: • • • The “padding” character is not necessary. The Base64-like transformation is applied only to specific characters Those characters that require Base64 transformation according to UTF-7 encoding begin with a “plus” character (+, 0x2B) and end with a “hyphen” (-, 0x2D) character. Character String M y 河 UCS-2 Encoding 004D 0079 0020 6CB3 8C5A UTF-7 Encoding M(4D) y(79) ASCII Codes 豚 20 +bLOMWg39 • UTF-8 • UFT-8 encoding is developed as a way to represent Unicode text as a stream of one or more eight-bits, rather than a true 16-bit units. • • • • • It converts UCS-2 into a mixed one- through three-byte encoding. It converts UCS-4 into a mixed one- through six-byte encoding. It converts UTF-16 into a mixed one- through four-byte encoding. It is therefore an eight-bit and variable-length encoding. UTF-8 is the de facto standard encoding for interchange of Unicode text. 40 • UTF-8 • Encoding Templates • • For all but the ASCII-compatible range, the number of first-byte high-order bits set to “1” indicates the byte length. Filling the templates from the rightmost side bits. UCS-2 Range UTF-8 Bit Arrays 0000-007F 0xxxxxxx 0080-07FF 110xxxxx 10xxxxxx 0800-FFFF 1110xxxx 10xxxxxx 10xxxxxx Unicode Range (+) UTF-8 Bit Arrays 0001 0000 – 001F FFFF 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx UCS-4 Range (+) UTF-8 Bit Arrays 0020 0000 – 03FF FFFF 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 0400 0000 – 7FFF FFFF 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 41 • UTF-8 Example: Convert a Unicode character “茶” into UTF-8 code. • • • Step 1” the Unicode value of 茶 (tea) is 8336. So we need 3 bytes. Step 2: the binary form of hexadecimal 8336 is 1000 001100 110110. Step 3: Fill the empty slots of the three-byte template with the binary value of and get: 1110xxxx 10xxxxxx 10xxxxxx 11101000 10001100 10110110 • Step 4: UTF-8 code value is thus E8 8C B6. 42 Wrap-Up • • • • 中文字符集 中文编码集 基本编码方式 • ASCII 中文编码方式 • • • • ISO-2022 EUC GBK BIG5 • 国际编码方式 • • • • • • • • Unicode ISO 10646-1 UCS-2 UCS-4 UTF-16 Base64 UTF-7 UTF-8 43