Chapter 2

advertisement
2 Background and Related Work
2.1 CJK Character Coding Schemes
Because most work on early computers took place in England and the
United States, computers tend to have character coding schemes that work well
with English texts. Also, English character coding schemes are reasonably
standard. The coding schemes for non-Latin characters however, are both less
standardized, and less fully developed. Japanese, Chinese, and Korean all use
Chinese characters, but each language has a range of coding schemes for such
characters. Below is a table showing some common character encoding schemes
employed for Japanese, Chinese and Korean.
Country
Encoding
Modal
Fixed Length
2.1.1 Japan:
EUC-JP
Shift-JIS
ISO-2022-JP
2.1.2 China
GBK
Big Five
Big Five Plus
EUC
ISO-2022-CN
ISO-2022-CN-EXT
2.1.3 Korean
Johab
EUC
ISO-2022-KR
Locale Dependent/Independent Character codings by country
The encodings listed above are simply different plans for encoding the
characters included in each scheme. Among the more interesting
encodings are the ISO-2022 schemes.
Insert table from page 148 about iso 2022
Use table page 140 of CJKV
A description of the major encoding schemes. Describe how modal and
non-modal schemes differ.
Also, Present Diagrams of how encoding system works.
(Show the coding for a few characters in the various coding schemes)
( This should also include the chart, like that found on P. 149 of CJKV
showing the encoding space in a chart form.)
Type
Unicode
Encoding
Modal
Fixed Length
UCS-2
UCS-4
UTF
UTF-7
UTF-8
Backwards
Compatible
Find start of chars
easily
International Encoding Methods
Show examples of characters encoded using each above coding scheme
Why International (especially UTF-8/UNICODE) is superior form of
coding
A problem with the UCS-2 and UCS-4 is that they use fixed-length
encoding. This style of encoding wastes space in the unused areas of the
encoding space. A disadvantage of UTF-7 is that like ISO-2022 it is a
modal encoding. Modal encodings use somewhat more space than a
similar non-modal encoding, especially in the case of text that switches
between differently escaped characters. Utf-8 encoding, on the other hand,
is a non-modal encoding that uses a variable amount of space.
2.2 Existing CJK Input Methods
Over the years, a number of systems have been developed for inputting
Asian languages. These input systems range from complex arrays of characters
on a pad with thousands of keys to chording style combinational inputs to simply
using the roman transliteration. The most basic division between different styles
of inp0ut techniques is between direct and indirect methods of character input.
Direct methods of input would include the keyboard with one key for each
character, requiring thousands of characters for Japanese. The standard English
QWERTY keyboard is another example of direct input. Each key press (when
inputting English text) corresponds to exactly one character. It is possible that
some modal information may have also been inserted first, such as a shift or
control key, but when the actual character is input, there is no ambiguity about
which character is being inserted. For languages using the Chinese characters,
creating a keyboard with this sort of hardware-supported direct input is quite
impractical. Therefore, the more common type of direct input is to implement it
in software. This is done by having the user type in the index number of the
character that he wishes to display. This will result in exactly ONE match, as the
index is what is used as the unique identifier for each character. For example, to
input 漢字 using direct-input Unicode, the user would type:
6F22 5B57
Where 6F22 is Unicode value for the character 漢, and 5B57 is the Unicode value
for the character 字. Using direct input requires that the user either remember, or
look up in a table every character that he wishes to input. The task is complicated
by the fact that these index values are assigned arbitrarily. For this reason, the
direct-input method is usually used only as a last resort.
Indirect input techniques will for many inputs have some ambiguity
regarding which character the user actually desires to be input. Most
implementations of these techniques rely on hashing the user input. For Chinese
input, a variety of options exist. One method is to type in the Romanized reading
of the character using the Pinyin Romanization scheme. This input method has
the user type in the Pinyin romanization of the desired Chinese character.
However, due to the number of homonyms present in the Chinese language, (p
61- huang) there will be a number of collisions possible for a give input. In order
to resolve the collisions, the user must pick from the list of characters with the
reading that was input. Examples of this process are shown below.
The other technique used for indirect input is to have the user input a
sequence of characters that are combined to for the Chinese character desired.
For Chinese, this might involve typing in a sequence of 3-4 keys. Each of these
keys will represent a given shape. The order in which these are entered will
determine the hashed value, and therefore the set of matches that might be
acquired by this input technique. For Example:
Input in the Japanese language is, by the most common techniques, a
combination of a direct technique with that of an indirect technique. Because the
hiragana syllable alphabet is limited in size, and the romanization schemes used
for input contain no collisions, the input of hiragana is really quite simple. The
user inputs the romanization of the character desired. For example, ‘ha’ would
yield the hiragana は. At this point, the transition from a direct to an indirect
encoding takes place. This input is hashed, which leads to the frequent
occurrence of collisions, and therefore a set of results. For purposes of input, it
makes sense to add katakana syllables to the list of kanji characters. The reason
for this should become apparent shortly. After typing in the character ‘は’, a list
of possible characters will appear on the screen:
葉,歯,派,刃,覇….. and the katakana, ハ.
Another example:
Input: ‘harajuku’ ->’はらじゅく’
The string ‘はらじゅく’ is now hashed, and the results are:
原宿, ハラジュク, or of course the input, はらじゅく
The user now selects from this list the character that matches what he desired to
input. Many systems do not keep a static ordering of the characters in the hash-
collision list, but rather reorder them based on the frequency of selection. Each
time a character is selected, it will up the list. Whether a character moves
immediately to the top of the list upon being selected, or receives a smaller
promotion is an implementation-specific detail with which we are not concerned.
The most commonly used input technique for Japanese is referred to as
transliteration, or inputting a character’s sound in a different alphabet. For
example, to input 日本語, the user types ‘nihongo’. Most Japanese keyboards
also include ふりがな
characters on the keyboard. This allows the user to type
the actual Japanese character desired. However, in order to input a 漢字,
transliteration comes into play once again. The user would first type in the
desired ふりがな characters. After completing the word, the user presses the
space bar, and a list is displayed containing possible 漢字 compounds that can be
read according to the ふりがな input by the user.
Another approach to inputting Chinese characters is to enter a sequence of
keystrokes, each of which represents a certain sub-character. The sub-characters
are then used to convert the input to a larger, complete Chinese character. The
order in which the components are typed is important, and there is a limit on the
number of keys that can be used to represent a character. For example,
This type of input works well for typing documents, but is difficult for a
novice in the language, as frequently he may only remember some part of the 漢
字, but not the pronunciation or meaning.
A more intuitive form of input for novice users of Japanese trying to input
vaguely known characters would be to do so based on shape. Of course, a
keyboard with all of the 漢字 on it would be far too large to be practical. This is
the motivation for a system that allows the user to input what parts of the
character he does know, and allow an algorithm to suggest what he was looking
for.
<A thorough background of these exists in the Huang book>
2.3 Regular Expressions and Unicode
Regular expressions were developed to be used with ASCII coding.
Therefore, using regular expressions with Unicode tends to lead to some
unwanted results. Confusion regarding how certain macros should behave exists
(“\b”, “.”, “w”, etc).
A shortcoming of using standard regular expressions for 漢字 searches is
that there is no real way to specify less than an entire word. In English, an RE like
“d.g” would return dig, dug and dog. Dig and dug are related to each other, but
dog certainly is not. However, if a user only remembered part of the word,
namely the ‘d’ and the ‘g’, without remembering what was in the middle, he
would be able to search for, and find his word.
If a user is searching for 家族, but cannot remember either of these (rather
complicated) characters, then regular expressions will not be of any use to him.
Perhaps he can remember the top of the 家 character, and the left side of the 族
character( 方). This despite his remembering parts of each “word” used to make
up 家族, he is unable to write a regular expression to locate it. My system allows
this search to be made using \k(TB<8:*>)\k(LR<方:*>) as the regular
expression.(家 is not in my dictionary yet!!!)
Download