2 Background and Related Work 2.1 CJK Character Coding Schemes Because most work on early computers took place in England and the United States, computers tend to have character coding schemes that work well with English texts. Also, English character coding schemes are reasonably standard. The coding schemes for non-Latin characters however, are both less standardized, and less fully developed. Japanese, Chinese, and Korean all use Chinese characters, but each language has a range of coding schemes for such characters. Below is a table showing some common character encoding schemes employed for Japanese, Chinese and Korean. Country Encoding Modal Fixed Length 2.1.1 Japan: EUC-JP Shift-JIS ISO-2022-JP 2.1.2 China GBK Big Five Big Five Plus EUC ISO-2022-CN ISO-2022-CN-EXT 2.1.3 Korean Johab EUC ISO-2022-KR Locale Dependent/Independent Character codings by country The encodings listed above are simply different plans for encoding the characters included in each scheme. Among the more interesting encodings are the ISO-2022 schemes. Insert table from page 148 about iso 2022 Use table page 140 of CJKV A description of the major encoding schemes. Describe how modal and non-modal schemes differ. Also, Present Diagrams of how encoding system works. (Show the coding for a few characters in the various coding schemes) ( This should also include the chart, like that found on P. 149 of CJKV showing the encoding space in a chart form.) Type Unicode Encoding Modal Fixed Length UCS-2 UCS-4 UTF UTF-7 UTF-8 Backwards Compatible Find start of chars easily International Encoding Methods Show examples of characters encoded using each above coding scheme Why International (especially UTF-8/UNICODE) is superior form of coding A problem with the UCS-2 and UCS-4 is that they use fixed-length encoding. This style of encoding wastes space in the unused areas of the encoding space. A disadvantage of UTF-7 is that like ISO-2022 it is a modal encoding. Modal encodings use somewhat more space than a similar non-modal encoding, especially in the case of text that switches between differently escaped characters. Utf-8 encoding, on the other hand, is a non-modal encoding that uses a variable amount of space. 2.2 Existing CJK Input Methods Over the years, a number of systems have been developed for inputting Asian languages. These input systems range from complex arrays of characters on a pad with thousands of keys to chording style combinational inputs to simply using the roman transliteration. The most basic division between different styles of inp0ut techniques is between direct and indirect methods of character input. Direct methods of input would include the keyboard with one key for each character, requiring thousands of characters for Japanese. The standard English QWERTY keyboard is another example of direct input. Each key press (when inputting English text) corresponds to exactly one character. It is possible that some modal information may have also been inserted first, such as a shift or control key, but when the actual character is input, there is no ambiguity about which character is being inserted. For languages using the Chinese characters, creating a keyboard with this sort of hardware-supported direct input is quite impractical. Therefore, the more common type of direct input is to implement it in software. This is done by having the user type in the index number of the character that he wishes to display. This will result in exactly ONE match, as the index is what is used as the unique identifier for each character. For example, to input 漢字 using direct-input Unicode, the user would type: 6F22 5B57 Where 6F22 is Unicode value for the character 漢, and 5B57 is the Unicode value for the character 字. Using direct input requires that the user either remember, or look up in a table every character that he wishes to input. The task is complicated by the fact that these index values are assigned arbitrarily. For this reason, the direct-input method is usually used only as a last resort. Indirect input techniques will for many inputs have some ambiguity regarding which character the user actually desires to be input. Most implementations of these techniques rely on hashing the user input. For Chinese input, a variety of options exist. One method is to type in the Romanized reading of the character using the Pinyin Romanization scheme. This input method has the user type in the Pinyin romanization of the desired Chinese character. However, due to the number of homonyms present in the Chinese language, (p 61- huang) there will be a number of collisions possible for a give input. In order to resolve the collisions, the user must pick from the list of characters with the reading that was input. Examples of this process are shown below. The other technique used for indirect input is to have the user input a sequence of characters that are combined to for the Chinese character desired. For Chinese, this might involve typing in a sequence of 3-4 keys. Each of these keys will represent a given shape. The order in which these are entered will determine the hashed value, and therefore the set of matches that might be acquired by this input technique. For Example: Input in the Japanese language is, by the most common techniques, a combination of a direct technique with that of an indirect technique. Because the hiragana syllable alphabet is limited in size, and the romanization schemes used for input contain no collisions, the input of hiragana is really quite simple. The user inputs the romanization of the character desired. For example, ‘ha’ would yield the hiragana は. At this point, the transition from a direct to an indirect encoding takes place. This input is hashed, which leads to the frequent occurrence of collisions, and therefore a set of results. For purposes of input, it makes sense to add katakana syllables to the list of kanji characters. The reason for this should become apparent shortly. After typing in the character ‘は’, a list of possible characters will appear on the screen: 葉,歯,派,刃,覇….. and the katakana, ハ. Another example: Input: ‘harajuku’ ->’はらじゅく’ The string ‘はらじゅく’ is now hashed, and the results are: 原宿, ハラジュク, or of course the input, はらじゅく The user now selects from this list the character that matches what he desired to input. Many systems do not keep a static ordering of the characters in the hash- collision list, but rather reorder them based on the frequency of selection. Each time a character is selected, it will up the list. Whether a character moves immediately to the top of the list upon being selected, or receives a smaller promotion is an implementation-specific detail with which we are not concerned. The most commonly used input technique for Japanese is referred to as transliteration, or inputting a character’s sound in a different alphabet. For example, to input 日本語, the user types ‘nihongo’. Most Japanese keyboards also include ふりがな characters on the keyboard. This allows the user to type the actual Japanese character desired. However, in order to input a 漢字, transliteration comes into play once again. The user would first type in the desired ふりがな characters. After completing the word, the user presses the space bar, and a list is displayed containing possible 漢字 compounds that can be read according to the ふりがな input by the user. Another approach to inputting Chinese characters is to enter a sequence of keystrokes, each of which represents a certain sub-character. The sub-characters are then used to convert the input to a larger, complete Chinese character. The order in which the components are typed is important, and there is a limit on the number of keys that can be used to represent a character. For example, This type of input works well for typing documents, but is difficult for a novice in the language, as frequently he may only remember some part of the 漢 字, but not the pronunciation or meaning. A more intuitive form of input for novice users of Japanese trying to input vaguely known characters would be to do so based on shape. Of course, a keyboard with all of the 漢字 on it would be far too large to be practical. This is the motivation for a system that allows the user to input what parts of the character he does know, and allow an algorithm to suggest what he was looking for. <A thorough background of these exists in the Huang book> 2.3 Regular Expressions and Unicode Regular expressions were developed to be used with ASCII coding. Therefore, using regular expressions with Unicode tends to lead to some unwanted results. Confusion regarding how certain macros should behave exists (“\b”, “.”, “w”, etc). A shortcoming of using standard regular expressions for 漢字 searches is that there is no real way to specify less than an entire word. In English, an RE like “d.g” would return dig, dug and dog. Dig and dug are related to each other, but dog certainly is not. However, if a user only remembered part of the word, namely the ‘d’ and the ‘g’, without remembering what was in the middle, he would be able to search for, and find his word. If a user is searching for 家族, but cannot remember either of these (rather complicated) characters, then regular expressions will not be of any use to him. Perhaps he can remember the top of the 家 character, and the left side of the 族 character( 方). This despite his remembering parts of each “word” used to make up 家族, he is unable to write a regular expression to locate it. My system allows this search to be made using \k(TB<8:*>)\k(LR<方:*>) as the regular expression.(家 is not in my dictionary yet!!!)