Example 4.1. Let us consider a concrete example of how the UTF

advertisement
Example 4.1. Let us consider a concrete example of how the UTF-8 code of a
code point is determined. The ASCII characters are not so interesting since for
these characters the UTF-8 code agrees with the code point. The Norwegian
character ’Å’ is more challenging. If we check the Unicode charts,4 we find that
this character has the code point c516 = 197. This is in the range 128–2047 which
is covered by rule 2 in fact 4.6. To determine the UTF-8 encoding we must find
the binary representation of the code point. This is easy to deduce from the
hexadecimal representation. The least significant numeral (5 in our case) determines the four least significant bits and the most significant numeral (c) determines the four most significant bits. Since 5 = 01012 and c 16 = 11002 , the code
point in binary is
c
5
z }| { z }| {
000 1100 01012 ,
where we have added three 0s to the left to get the eleven bits referred to by
rule 2. We then distribute the eleven bits as in (4.4) and obtain the two bytes
11000011,
10000101.
In hexadecimal this corresponds to the two values c3 and 85 so the UTF-8 encoding of ’Å’ is the two-byte number c38516 .
4 The Latin 1 supplement can be found at www.unicode.org/charts/PDF/U0080.pdf/.
Download