Chapter 1

advertisement
1 Introduction
1.1 CJK Scripts
Computers were developed mainly to support Western languages. However, with
the rapid adoption of computing in non-western countries, certain issues began to arise.
One major issue is how to deal with the Chinese, Japanese, and Korean writing systems.
How could an input system designed for 52 phonetic symbols deal with literally
thousands of ideographic symbols?
Unlike those of Western languages, CJK languages contain both syllable and
ideographic characters. Some of these writing systems, such as Japanese hiragana and
katakana, are linear. However, other systems, such as Japanese kanji are made up of two
dimensional block characters. A more thorough description of block characters appears
in section 1.1.4 “Block Characters”. A description of the main writing systems of China,
Korea, and Japan, as well as some brief historical background, appears below.
1.1.1 Chinese Writing Systems
1.1.1.1
Lading Zimu (Latin Characters)
These characters are simply the set of latin characters used in the Chinese language.
They comprise the uppercase letters, lowercase letters, and numerals zero through nine.
These are used in places such as mathematical formulas, charts, or transliteration. The
most frequently used techniques for transliteration are Pinyin, used primarily in China,
and Wade-Giles, used primarily in Taiwan.
1.1.1.2
Zhuyin
Zhuyin, also commonly referred to as bopomofo, is a system used to transform Chinese
characters into its simpler component elements based on the pronounciation. This system
is nearly identical in China and Taiwan.
1.1.1.3
Hanzi
Hanzi is the traditional system of Chinese characters. As of 1994, there were
approximately 85,000 Hanzi in use in China.1 In China, simplified versions of many
character have come into common use. Taiwan, however, continues to use the traditional,
more complex version of the characters.
1.1.2 Korean Writing Systems
1.1.2.1 Hangul
The Hangul writing system is a rather unique writing system made up of symbols formed
of individual letters. These individual letters are called jamo. These hangul characters
are used to create the words when writing in Korean.
1.1.2.2 Hanja
Chinese characters that are used in Korean are called hanja (same char as Japanese
“kanji”). These were borrowed from China on the characters’ way to Japan. In the past,
hanja were used quite frequently. However, in modern Korea, most writing is done in
Hangul.
1.1.3 Japanese Writing Systems
1.1.3.1 Hiragana
Over the years, within the writing system imported from china, a subset started to
form. This set of characters was a greatly simplified version of some commonly used
Chinese characters. Also, in a break from the original style of the Chinese writing
system, these extremely simplified characters were used as a set of the possible syllables
that made up the Japanese language. This was a marked difference from the
ideographical system of writing that had originally been imported from the Chinese
mainland.
This is not to say that the Japanese writing system stopped using ideographical
characters. Rather, the writing system consists of a mix of furigana and kanji (Chinese
1
Page 58, CJKV
characters). For instance, これはひらがなと漢字があります. “This has hiragana and
Kanji” is a sentence that uses both scripts. This leads to a very complicated system,
where a given character can have multiple possible phonetic readings. For example, hi,
ni, nichi are all readings of the character 日.
1.1.3.2 Katakana
Katakana is similar to hiragana in some ways. For example, katakana, like
hiragana, is a syllabary. Also there are corresponding characters for each character found
in the hiragana syllabary. Many of the katakana are extremely similar to their
corresponding hiragana character. In fact, most of these similar characters are derived
from the same Kanji. The most common uses of katakana are for foreign loan words,
names of those who do not have a Chinese character name, onomatopoeic words, and to
add emphasis.
For example, “Scott” is written as スコット rather than そこっと. Also, the
word __________ is an example of onomatopoeia. Japanese words are sometimes
written in katakana to show emphasis. An example might be イヤダ! In Japanese,
many foreign loan words that aren’t of Chinese origin are from English. Words like バス
ケットボール (Basketball) or テレビ
(Televeision) are common examples of this
phenomenon.
1.1.3.3 Kanji
Chinese characters, or 漢字, were borrowed extensively up until 1300 by Japan.2
Most of these characters came into Japan over three distinct periods, in the form of kanji
compounds. Compounds are multi-syllable words. Often times, when compounds
borrowed from different eras contained the same symbol, the symbol would be
pronounced differently in the different compounds.
Another character type is the set of 国字 (こくじ), which is the set of Chinesestyle characters created in Japan. Examples include characters such as 働 (はたらく)
2
P.59 cjkv
and 枠(わく). This class of characters came about to create needed new words at a time
when Japan was in isolation, and therefore unable to borrow a character.
Something about the 音読み
おんよみ and 訓読み くんよみ
The characters borrowed from Chinese have readings that can be classified as one
of two types. The first possible reading is the kunyomi, which is the reading that comes
from the original Japanese reading. These come from before the first wide scale
integration of Chinese characters. The second reading is the onyomi. This is the reading
that came from the Chinese reading. It is possible for a character to have more than one
onyomi.
1.1.3.4 Latin(European)
A system of romanization exists for the Japanese writing system. It uses the
standard western alphabet to transliterate the sounds of words written in katakana,
hiragana, and kanji. One of the more widely used systems is called the Hepburn system.
か = ka
カ = ka
家 = ka
Figure XXX
Figure XXX shows an example of hiragana, katakana, and kanji characters, all of which
can be transliterated to “ka”.
In modern japan, there is also frequent use of the latin alphabet for purposes other
than transliterations. Sometimes this is for multinational corporations, where the
company wishes to do business overseas. The company in this situation will usually
write its name in Latin characters. Some examples are Sony and Sanyo. Popular culture
also frequently uses latin script, and English words in advertising and other media.
1.1.4 Block Characters
The Korean system of [Korean writing system-not Chinese characters], as well as
the use of Chinese characters in Japanese, Korean, and Chinese, are all examples of block
characters. For an example of what is meant by “block characters”, consider the
following example. The Japanese character 口 by itself is pronounced くち, and means
mouth. This character is also a component of a number of other 漢字. For example, 言,
味, 語,裕,邑,右, to name just a few. This ability of one character to appear in multiple
positions in another character makes it a block character (two dimensional). It is not only
stand alone, but also components called “radicals” that can appear in various places in
kanji.
In Korean, a system with more direct block structure is in place. More
information about how these are composed, and what makes them block characters.
1.1.5 Problems with Existing Techniques
Despite the two dimensional characteristics found in these block character writing
systems, there is currently no viable system in place to take advantage of the 2
dimensional nature of the characters. Use of a standard keyboard requires that characters
be input serially. However, the serialization schemes used, while efficient for typing,
offer little if any insight into the composition into the characters. Instead, they are simply
hashing functions.
In order to exploit the information built into the two-dimensional form of the
characters, another serialization technique is needed. Section 1.2 describes my
serialization technique. In regular typing, an extended serialization technique that reveals
structure would not be especially efficient. This serialization technique is not intended to
replace the hashing methods in those instances. Instead, it is meant to provide an
alternate means of searching for text in a document, or determining which character to
use, without requiring the ability to form the entire character desired.
1.2 Approach: Extended Regular Expressions
The need then, is to find a technique to accurately describe these threedimensional writing systems in what is an inherently two-dimensional system. Also,
once this serialization scheme is in place, it seems only logical to extend it to allow text
searches on Asian languages the same amount of freedom available on western language
text. The serialization of the characters has allowed them to now be treated as nonatomic units. This is the key to allowing this new approach.
Download