1 Introduction 1.1 CJK Scripts Computers were developed mainly to support Western languages. However, with the rapid adoption of computing in non-western countries, certain issues began to arise. One major issue is how to deal with the Chinese, Japanese, and Korean writing systems. How could an input system designed for 52 phonetic symbols deal with literally thousands of ideographic symbols? Unlike those of Western languages, CJK languages contain both syllable and ideographic characters. Some of these writing systems, such as Japanese hiragana and katakana, are linear. However, other systems, such as Japanese kanji are made up of two dimensional block characters. A more thorough description of block characters appears in section 1.1.4 “Block Characters”. A description of the main writing systems of China, Korea, and Japan, as well as some brief historical background, appears below. 1.1.1 Chinese Writing Systems 1.1.1.1 Lading Zimu (Latin Characters) These characters are simply the set of latin characters used in the Chinese language. They comprise the uppercase letters, lowercase letters, and numerals zero through nine. These are used in places such as mathematical formulas, charts, or transliteration. The most frequently used techniques for transliteration are Pinyin, used primarily in China, and Wade-Giles, used primarily in Taiwan. 1.1.1.2 Zhuyin Zhuyin, also commonly referred to as bopomofo, is a system used to transform Chinese characters into its simpler component elements based on the pronounciation. This system is nearly identical in China and Taiwan. 1.1.1.3 Hanzi Hanzi is the traditional system of Chinese characters. As of 1994, there were approximately 85,000 Hanzi in use in China.1 In China, simplified versions of many character have come into common use. Taiwan, however, continues to use the traditional, more complex version of the characters. 1.1.2 Korean Writing Systems 1.1.2.1 Hangul The Hangul writing system is a rather unique writing system made up of symbols formed of individual letters. These individual letters are called jamo. These hangul characters are used to create the words when writing in Korean. 1.1.2.2 Hanja Chinese characters that are used in Korean are called hanja (same char as Japanese “kanji”). These were borrowed from China on the characters’ way to Japan. In the past, hanja were used quite frequently. However, in modern Korea, most writing is done in Hangul. 1.1.3 Japanese Writing Systems 1.1.3.1 Hiragana Over the years, within the writing system imported from china, a subset started to form. This set of characters was a greatly simplified version of some commonly used Chinese characters. Also, in a break from the original style of the Chinese writing system, these extremely simplified characters were used as a set of the possible syllables that made up the Japanese language. This was a marked difference from the ideographical system of writing that had originally been imported from the Chinese mainland. This is not to say that the Japanese writing system stopped using ideographical characters. Rather, the writing system consists of a mix of furigana and kanji (Chinese 1 Page 58, CJKV characters). For instance, これはひらがなと漢字があります. “This has hiragana and Kanji” is a sentence that uses both scripts. This leads to a very complicated system, where a given character can have multiple possible phonetic readings. For example, hi, ni, nichi are all readings of the character 日. 1.1.3.2 Katakana Katakana is similar to hiragana in some ways. For example, katakana, like hiragana, is a syllabary. Also there are corresponding characters for each character found in the hiragana syllabary. Many of the katakana are extremely similar to their corresponding hiragana character. In fact, most of these similar characters are derived from the same Kanji. The most common uses of katakana are for foreign loan words, names of those who do not have a Chinese character name, onomatopoeic words, and to add emphasis. For example, “Scott” is written as スコット rather than そこっと. Also, the word __________ is an example of onomatopoeia. Japanese words are sometimes written in katakana to show emphasis. An example might be イヤダ! In Japanese, many foreign loan words that aren’t of Chinese origin are from English. Words like バス ケットボール (Basketball) or テレビ (Televeision) are common examples of this phenomenon. 1.1.3.3 Kanji Chinese characters, or 漢字, were borrowed extensively up until 1300 by Japan.2 Most of these characters came into Japan over three distinct periods, in the form of kanji compounds. Compounds are multi-syllable words. Often times, when compounds borrowed from different eras contained the same symbol, the symbol would be pronounced differently in the different compounds. Another character type is the set of 国字 (こくじ), which is the set of Chinesestyle characters created in Japan. Examples include characters such as 働 (はたらく) 2 P.59 cjkv and 枠(わく). This class of characters came about to create needed new words at a time when Japan was in isolation, and therefore unable to borrow a character. Something about the 音読み おんよみ and 訓読み くんよみ The characters borrowed from Chinese have readings that can be classified as one of two types. The first possible reading is the kunyomi, which is the reading that comes from the original Japanese reading. These come from before the first wide scale integration of Chinese characters. The second reading is the onyomi. This is the reading that came from the Chinese reading. It is possible for a character to have more than one onyomi. 1.1.3.4 Latin(European) A system of romanization exists for the Japanese writing system. It uses the standard western alphabet to transliterate the sounds of words written in katakana, hiragana, and kanji. One of the more widely used systems is called the Hepburn system. か = ka カ = ka 家 = ka Figure XXX Figure XXX shows an example of hiragana, katakana, and kanji characters, all of which can be transliterated to “ka”. In modern japan, there is also frequent use of the latin alphabet for purposes other than transliterations. Sometimes this is for multinational corporations, where the company wishes to do business overseas. The company in this situation will usually write its name in Latin characters. Some examples are Sony and Sanyo. Popular culture also frequently uses latin script, and English words in advertising and other media. 1.1.4 Block Characters The Korean system of [Korean writing system-not Chinese characters], as well as the use of Chinese characters in Japanese, Korean, and Chinese, are all examples of block characters. For an example of what is meant by “block characters”, consider the following example. The Japanese character 口 by itself is pronounced くち, and means mouth. This character is also a component of a number of other 漢字. For example, 言, 味, 語,裕,邑,右, to name just a few. This ability of one character to appear in multiple positions in another character makes it a block character (two dimensional). It is not only stand alone, but also components called “radicals” that can appear in various places in kanji. In Korean, a system with more direct block structure is in place. More information about how these are composed, and what makes them block characters. 1.1.5 Problems with Existing Techniques Despite the two dimensional characteristics found in these block character writing systems, there is currently no viable system in place to take advantage of the 2 dimensional nature of the characters. Use of a standard keyboard requires that characters be input serially. However, the serialization schemes used, while efficient for typing, offer little if any insight into the composition into the characters. Instead, they are simply hashing functions. In order to exploit the information built into the two-dimensional form of the characters, another serialization technique is needed. Section 1.2 describes my serialization technique. In regular typing, an extended serialization technique that reveals structure would not be especially efficient. This serialization technique is not intended to replace the hashing methods in those instances. Instead, it is meant to provide an alternate means of searching for text in a document, or determining which character to use, without requiring the ability to form the entire character desired. 1.2 Approach: Extended Regular Expressions The need then, is to find a technique to accurately describe these threedimensional writing systems in what is an inherently two-dimensional system. Also, once this serialization scheme is in place, it seems only logical to extend it to allow text searches on Asian languages the same amount of freedom available on western language text. The serialization of the characters has allowed them to now be treated as nonatomic units. This is the key to allowing this new approach.