Character Encodings and Unicode - Inter

advertisement
Character Encodings
& Unicode
Unicode: A Grand Tour
This presentation and its associated materials licensed under a
Creative Commons Attribution-Noncommercial-No Derivative
Works 2.5 License.
You may use these materials without obtaining permission from the
author. Any materials used or redistributed must contain this notice.
[Derivative works may be permitted with permission of the author.]
This work is copyright © 2008 Addison P. Phillips
 Addison Phillips
 Globalization Architect, Lab126
 This Presentation
 “Internationalization and Unicode Conference”
Tutorial
 Globalization Architect,
Lab126
(Yes, you can touch my Kindle)
 Chair,
W3C Internationalization WG
 Editor,
IETF LTRU-WG (BCP 47)
Unicode
 the design and development of a product that is
enabled for target audiences that vary in culture,
region, or language. [W3C]
 a fundamental architectural approach to software
development
Opinions differ on
capitalization (C12N);
choose from:

i18N

I18n

I18n

I18N
Very geeky; not very
internationalized
(I19G?)
Mystic Numbering (M4C N7G)
II N1 T2 E3 R4 N
ATI O NALI ZATI O N
5 6 7 8 9 10 11 12 13 14 15 16 17 18 N
I18N
Localization
Globalization
Canonicalization
=
=
=
L10N
G11N
C14N
The basics of text processing in software.
“Character encodings consume more than 80% of
my work day. They are the source of more misinformation and confusion than any other single
thing. And developers aren’t getting any better
educated.”
~Glen Perkins
Globalization Architect
Real Jargon
Potentially Bogus Jargon
Multibyte
kanji
Variable width
double-byte language
Wide character
extended ASCII
Character encoding
ANSI, OEM
Coded character set
encoding agnostic
Bidi or bidirectional
Glyph, character, code unit
Unicode
“bits”: 010000010101101101101000
“byte” or “octet”: 01000001 (0x41)
code unit: a unit of physical storage and information interchange
• represent numbers
• come in various sizes (e.g. 7, 8, 16, 32, 64 bits)
how do we map text to the numbers used by computers?
Glyphs
 A “glyph” is screen unit of text: it’s a picture
of what users think of as a character.
 A “grapheme” is a single visual unit of text.
À
U+00C0
Characters
 A “character” is a single logical unit of text.
 A “character set” is a set of characters.
 A “code point” is a number assigned to a
character in a character set.
 A “coded character set” is a character set
where each character has a code point.
Bytes
 A “character encoding” maps a sequence
of code points (“characters”) to a sequence
of code units (such as bytes).
 A “code unit” is a single logical unit of
storage.
… 0xC3 0x80 …
 Collection (repertoire) of characters, that is: a set.
 Organized so that each character has a unique numeric
(typically integer) value (code point).
 Examples:
 Unicode
 ASCII (ANSI X3.4)
 ISO 646
 JIS X 208
 Latin-1 (ISO 8859-1)
Character sets are often
associated with a
particular language or
writing system.
 Maps a sequence of code points (characters) to a
sequence of code units (e.g. bytes).
 Some encodings use another unit instead of the byte.
For example, some encodings use a 16-bit, 32-bit, or 64bit code unit.
U+00C0
0xC3 0x80
In memory, on disk, on the network, etc.
All text has a character
encoding
When things go wrong, start by asking what the
encoding is, what encoding you expected it to be,
and whether the bytes match the encoding.
Tofu
hollow boxes
Mojibake
Question Marks
garbage characters
(conversion not supported)
 Can appear as either
hollow boxes (empty
glyph) or as question
marks (Firefox, for
example)
 Not usually a bug: it’s a
display problem
 Can mask or
masquerade as
character corruption.
When Good Characters Go Bad
 View text using the
 Convert to or from the
wrong encoding
 Apply a transfer
encoding and forget to
remove it
 Convert to an
encoding twice
wrong encoding
 Overzealous escaping
 Conversion to entities
(“entitization”)
 Multiple conversions
 7 bits = 27 = 128 characters
 Enough for “U.S. English”
ASCII for
characters 0x00
through 0x7F
Accented letters
and other symbols
0x80 through 0xFF
char
Cp1252
Cp437
Cp850
È
0xC8
?
0xD4
Windows’s encodings
(called “code pages”)
are generally based on
standard encodings—
plus some additional
characters.
Example:
 CP 1252 is based on ISO
8859-1, but includes 27
“extra” characters in the
C1 control range (0x800x9F)
 Originally an IBM
character encoding term.
 IBM numbered their
character sets with
“CCSIDs” (coded character
set ids) and numbered the
corresponding character
encodings as “code pages”.
 Microsoft borrowed code
pages to create PC-DOS.
 Microsoft defines two kinds
of code pages:
 “ANSI” code pages are the
ones used by Windows GUI
programs.
 “OEM” code pages are the
ones used by command
shell/command line
programs.
 Neither “ANSI” nor “OEM”
refer to a particular encoding
standard or standards body in
this context.
 Avoid the use of ANSI and
OEM when referring to
encodings.
 So far we’ve been
looking at single-byte
encodings:
 one byte per character
 1 byte = 1 character (= 1
glyph?)
 256 character maximum
 Good enough for most
alphabetic languages
À
Some languages need more
characters.
What about the “double-byte”
languages?
Don’t those take two bytes per
character?
丏丣並
 Escape sequences to select
another character set
 Example: ISO 2022 uses escape
sequences to select various
encodings
 Use a larger code unit (“wide”
character encoding)
 Example: IBM DBCS code
pages or Unicode UTF-16
 216 = 64K characters
 232 = 4.2 billion characters
 Use a variable-width encoding
Variable width encodings
use different numbers of
code units to represent
different types of
characters within the same
encoding
One or more bytes per character
 1 byte != 1 character
 May use 1, 2, 3, or 4 bytes per
character
 May use shift or escape
sequences
 May encode more than one
character set
 In fact, single-byte encodings are
a special case of multibyte!
Multibyte Encoding: Any
“variable-width” encoding that
uses the byte as its code unit.
JIS X 213
 11,233
characters
 (2) 94x94
character
planes
JIS X 213: A “Multibyte” Character Set
 Specific byte ranges
encoding characters that
take more than one byte.
 A “lead byte”
 One or more “trailing bytes”
あ
A
1-4-1
1-3-33
(code point)
(code point)
0x82 0xA0
0x41
 Code point != code unit
l
e
a
d
t
r
a
i
l
s
i
n
g
l
e
b
y
t
e
b
y
t
e
b
y
t
e
 In order to reach
more characters,
Shift_JIS
characters start
with a limited
range of “lead
bytes”
 These can be
followed by a
larger range of
byte values
(“trail byte”)
 Lead bytes can be
trail byte values
 Trail bytes include
ASCII values
 Trail bytes include
special values such
as 0x5C (“\”)
int pos = strchr(mybuf, ‘@’);
 Stateful Encodings
 ex. IBM “MBCS” code pages [SI/SO shift between 1byte and 2-byte characters]
 ISO 2022 [escape sequence changes character set being
encoded]
 A transfer encoding syntax is a reversible transform of encoded
data which may (or may not) include textual data represented in
one or more character encoding schemes.
 Email headers
 URIs
 IDN (domain names)
Abcソース
=?UTF-8?B?QWJj44K
944O844K5?=
Abcソース
Common Encoding
Conversion Tools
and Libraries
Templates
ISO 8859-1
• iconv (Unix)
Content
UTF-8
Process
Output
(HTML, XML, etc.)
• ICU (C, C++,
Java)
• perl Encode
• Java
(native2ascii,
IO/NIO)
Data
Shift_JIS
 Document formats
often require a single
character encoding be
used for all parts of
the document.
 When data is merged,
the encodings must be
merged also (or some
of the data will be
“mojibake”).
• (etc.)
ISO 8859-1
ÀàС£
ISO 8859-1
UTF-8
детски
»èç‫ينس‬文字
ÀàС£
??????
»èç?????
????
UTF-8
ÀàС£
??????
»èç?????
????
Shift_JIS
文字化け
? (0x3F) is the replacement
character for ISO 8859-1
Encoding conversion acts as a “filter”
 Replacement characters (“question marks”) replace
characters from the source character set that are not
present in the target character set.
 Need for more converters and
conversion maps
 Difficulty of passing, storing,
and processing data in multiple
encodings
 Too many character sets…
…leads to what we call “code
page hell”
 Basic Principles










Universal repertoire
Logical order
Efficiency
Unification
Characters, not glyphs
Dynamic composition
Semantics
Stability
Plain Text
Convertibility
 Fights mojibake
because:
 characters are from the
common repertoire;
 characters are encoded
according to one of the
encoding forms;
 characters are
interpreted with
Unicode semantics;
 unknown characters are
not corrupted
Unicode is a character set that supports all of the world’s
languages and writing systems.
 Code space of up to 0x10FFFF characters (about 1.1
million)
 Unicode and ISO 10646 are maintained in sync.
 Unicode is maintained by an industry consortium.
 ISO 10646 is maintained by the ISO.
 Divide Unicode in equal
sized regions of code points.
 17 planes (0 through 0x10),
each with 65,535 characters.
 Plane 0 is called the Basic
Multilingual Plane (BMP).
 > 99% of text in the wild lives
in the BMP
 Planes 1 through 0x10 are
called supplementary planes.
 An organized
collection of
characters.
 Each character has a
code point
aka Unicode Scalar Value
(USV)
 U+0041 <= hex
notation









code point
name
character class
combining level
bidi class
case mappings
canonical decomposition
mirroring
default grapheme clustering
ӑ (U+04D1)
CYRILLIC SMALL LETTER A WITH BREVE





letter
non-combining
left-to-right
decomposes to U+0430 U+0306
Ӑ U+04D0 is uppercase (and titlecase)
Many characters were included in
Unicode for round-trip conversion
compatibility with legacy encodings:
①②③45Ⅵ
¾Lj¼Nj½dž
︴︷︻︽﹁﹄
ヲィゥォェュ゙
‫ﺲﺳﻫﺽﵬﷺ‬
fiflffifflſtﬔ
Compatibility Characters
includes presentation forms
legacy encoding: a term for nonUnicode character encodings.
U+FEFF
 Used to indicate the “byte-order” of UTF-16 code units
 0xFE FF; 0xFF FE
 Also used as a Unicode signature by some software (Windows’s
Notepad editor, for example) for UTF-8
 0xEF BB BF
Appears as a character or renders as
junk in some formats or on some
systems. For example, older browsers
render it as three bytes of mojibake.
U+FFFD
 Indicates a bad byte
sequence or a
character that could
not be converted.
 Equivalent to
“question marks” in
legacy encoding
conversions
�
there was a character here,
but it is gone now
 Composition can create “new” characters
 Base + non-spacing (“combining”) characters
A+˚ = Å
U+0041 + U+030A = U+00C5
a+ˆ+.=ậ
U+0061 + U+0302 + U+0323 = U+1EAD
a+.+ˆ=ậ
U+0061 + U+0323 + U+0302 = U+1EAD
ญัตติที่เสนอได้ผา่ นที่ประชุมด้วยมติเอกฉันท
ญั = ญ + ัั
glyph = consonant + vowel
ญัตติที่เสนอได้ผา่ นที่ประชุมด้วยมติเอกฉันท (word boundaries)
What is Unicode?
यू नि को ड क्या है ?
यू नि को ड
य ूू ि िू क ूो ड
ि + िू = नि
க ொ
‘ko’
U+0B95
Ka
U+0BBE
Aa
U+0BC6
E
Combining mark drawn to the left of the base character
 UTF-32
 Uses 32-bit code units.
 All characters are the same width.
 UTF-16
 Uses 16-bit code units.
 BMP characters use one 16-bit code unit.
 Supplementary characters use two special 16-bit code units: a
“surrogate pair”.
 UTF-8




Uses 8-bit code units (bytes!)
It’s a multi-byte encoding!
Characters use between 1 and 4 bytes.
ASCII is ASCII in UTF-8
A (U+0041)
UTF-32:
UTF-16:
UTF-8:
À (U+00C0)
0x0000041
0x0041
0x41
0x000000C0
0x00C0
0xC2 0x80
𐌸(U+10338)
ቐ (U+1251)
UTF-32:
UTF-16:
UTF-8:
UTF-32:
UTF-16:
UTF-8:
0x00001251
0x1251
0xE1 0x89 0x91
0x00010338
0xD800 0xDF38
0xF0 0x90 0x8C 0xB8
 Uses 32-bit code units (instead of the more-familiar 8-
bit code unit, aka the “byte”)
 Each character takes exactly one code unit.
U+1251
ቑ
0x00001251
U+10338
𐌸
0x00010338
 Easy to process
 each logical character
takes one code unit
 can use pointer arithmetic
 Not commonly used
 Not efficient for storage
 11 bits are never used
 BMP characters are the
most common—16 bits
wasted for each of these
 Affected by processor
architecture (Big-Endian
vs. Little-Endian)
 Uses 16-bit code units (instead of the more-familiar 8-
bit code unit, aka the “byte”)
 BMP characters use one unit
 Supplementary characters use a “surrogate pair”, special code
points that don’t do anything else.
0x1251
U+1251 ቑ
0xD800 0xDF38
U+10338 𐌸
High Surrogate
Low Surrogate
0xD800-DBFF
0xDC00-DFFF
Unique Ranges!
 Most common languages
and scripts are encoded in
the BMP.
 Less wasteful than UTF-32
 Simpler to process
(excepting surrogates)
 Commonly supported in
major operating
environments,
programming languages,
and libraries
 May not be suitable for all
applications
 Affected by processor
architecture (Big-Endian
vs. Little-Endian)
 Requires more storage, on
average, for Western
European scripts, ASCII,
HTML/XML markup.
 7-bit ASCII is itself
 All other characters take 2, 3, or 4 bytes each
 lead bytes have a special pattern
 trailing bytes range from 0x80->0xBF
Lead Bytes
Trail Bytes
Code Points
0xxxxxxx
< 0x80
110xxxxx 10xxxxxx
< 0x800
1110xxxx 10xxxxxx 10xxxxxx
< 0x10000
11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
Supplementary
 ASCII-compatible
 Default or recommended





encoding for many
Internet standards
Bit pattern highly
detectable (over longer
runs)
Non-endian
Streaming
C char* friendly
Easy to navigate
 Multibyte encoding
requires additional
processing awareness
 Non-shortest form
checking needed
 Less efficient than UTF-16
for large runs of Asian text
 Set Web server to declare UTF-8 in HTTP Content-Type header
 Declare UTF-8 in META tag header
 Actually use UTF-8 as the encoding!!
<?php
header("Content-type: text/html; charset=UTF-8");
?>
<html>
<head>
<meta
http-equiv="Content-Type"
content="text/html; charset=UTF-8” />
<title>Fight 文字化け!</title>
</head>
It’s more than just a character set and some encodings…
Unicode provides additional information:










Character name
Character class
“ctype” information, such as if it’s a digit, number, alphabetic, etc.
Directionality (LTR, RTL, etc.) and the Bidi Algorithm
Case mappings (UPPER, lower, and Titlecase)
Default Collation and the Unicode Collation Algorithm (UCA)
Identifier names
Regular Expression syntaxes
Normalization
Compatibility information
Many of these items are in the form of Unicode Technical Reports
 http://www.unicode.org/reports
Unicode Normalization has to deal
with more issues:
• single or multiple combining marks
Abc
ABC
abc
abC
aBc
• compatibility characters
• presentation forms
Ǻ
U+01FA
U+00C5 U+0301
abc
U+00C1 U+030A
U+212B U+0301
U+0041 U+0301 U+030A
U+0041 U+030A U+0301
Ǻ
ways to represent:
U+01FA
U+00C5 U+0301
U+00C1 U+030A
U+212B U+0301
U+0041 U+0301 U+030A
U+0041 U+030A U+0301
 Form D
canonical decomposition
 Form C
canonical decomposition
followed by composition
 Form KD
kompatibility
decomposition
 Form KC
kompatibility
decomposition followed by
composition
Ǻ
Original
Form C
Form D
Form KC
Form KD
U+01FA
U+01FA
U+0041 U+0301
U+030A
U+01FA
U+0041 U+0301
U+030A
U+00C5 U+0301
U+01FA
U+0041 U+0301
U+030A
U+01FA
U+0041 U+0301
U+030A
U+00C1 U+030A
U+01FA
U+0041 U+0301
U+030A
U+01FA
U+0041 U+0301
U+030A
U+212B U+0301
U+212B U+0301
U+212B U+0301
U+01FA
U+0041 U+0301
U+030A
U+0041 U+0301
U+030A
U+01FA
U+0041 U+0301
U+030A
U+01FA
U+0041 U+0301
U+030A
U+0041 U+030A
U+0301
U+01FA
U+0041 U+0301
U+030A
U+01FA
U+0041 U+0301
U+030A
 Not all compatibility characters have a compatibility
decomposition.
 Not all characters that look alike or have similar
semantics have a compatibility decomposition.
 For example, there are many ‘dots’ used as a period.
 Not all character variations are handled by
normalization.
 For example, upper, title, and lowercase variations.
 Normalization can remove meaning
 Some languages are written
predominantly from leftto-right (LTR).
 Some languages are written
predominantly from rightto-left (RTL).
 (A few can be written topto-bottom or using other
schemes)
Unicode defines character
“directionality” and a “Bidi”
algorithm for rendering text.
 Uses logical, not visual,
order.
 Uses levels of
“embedding”.
 Requires markup changes
in some HTML for full
support.
Characters are encoded in logical order.
Visual order is determined by the layout.
Override and bidi control characters
 “Indeterminate” characters

Paste in Arabic
 Defines default collation algorithm and sequences
(UTS#10)
 Must be tailored by language and “locale” (culture) and
other variations.
Language
Usage
Customizations
Swedish:
z<ö
German:
ö<z
German
Dictionary:
öf < of
German
Telephone:
of < öf
Upper-first
A<a
Lower-First
a<A
Find grapheme, word, and line-break boundaries in
text.
• Tailored by language
• Provides good basic default handling
 Remember “all text
has an encoding”?






user input via forms
email
data feeds
existing, legacy data
database instances
uploads
 Use UTF-8 for HTML and




Web forms
Use UTF-8 in your APIs
Check that data really is UTF8
Control encoding via code;
avoid hard-coding the
encoding
Watch out for legacy
encodings
 Convert to Unicode as soon
as practical.
 Convert from Unicode as
late as possible.
 Wrap Unicode-unfriendly
technologies
Your System
Map Your System
APIs
 use Unicode encoding
 hide internal storage
encoding
Convert to
Legacy
Unicode
Interface
Unicode Cloud
Data Stores, Local I/O
 use Unicode encoding
 consider an encoding
conversion plan
Front Ends
API
Detect / Convert
 use Unicode encoding
Back Ends, External
Data
 Uses Unicode?
 If not, what encoding?
 Store the encoding!
Legacy
Encoding
Unicode
Capture
Encoding
Detect / Convert
Input
Be aware of whether you need to
count glyphs, characters, or bytes:
 Is the limit “screen positions”,
“characters”, or “bytes of storage”?
 Should you be using a different limit?
Which one are you actually counting?
varchar(110)
यूनिकोड
य ूू ि नू क ूो ड
(4 glyphs)
(7 characters)
E0-A4-AF E0-A5-82 E0-A4-A8 E0-A4-BF E0-A4-95 E0-A5-8B E0-A4-A1
(21 bytes)
 Code unit
 Code point
 Character
 Glyph
 Multibyte encoding
 Tofu
 Mojibake
 Question Marks
 “All text has an
encoding”
 17 planes of goodness
 1.1 million potential code
points
 150,000 assigned code
points
 3 encodings
 UTF-32
 UTF-16
 UTF-8
 Normalize
 Bidi
 Collation
 Case folding
 … and so much more
Q&A
“Would you
please write the
code for I18N on
the whiteboard
before you go?”
#import i18n.h
#define
UNICODE
Would you write the code for I18N on the
whiteboard before you go?
#define UNICODE
#import I18N.h
Download