Dynamic Glyph Generation Based on variable length encoding schema Yap Cheah Shen eForth Technology. Glyph & Typesetting Workshop Kyoto, 29Nov2003 Outline of Presentation Morpheme: Latin vs. Han Latin text encoding Missing character in Chinese text Solution Implementation details • Glyph decomposition database • Topological conversion of strokes • Automatic frame calculation Integrating to existing OS Other issue Morpheme: Latin vs. Han Morpheme is the smallest meaningful unit in a language. For Latin text, it is “word”. For Chinese text, it is Hanzi or Kanji. Representing a real-world idea, morpheme keeps changing from time to time Morphemes form an open-set. Latin Text Encoding Alphabets form a fix set of symbols. All words can be represented as sequences of alphabets. They are the ideal encoding units for Latin text; e.g., ASCII. No “missing word” encoding problem. Missing Characters in Chinese Text Not all existing Hanzi are encoded. Hanzi are in an open-set , theoretically, historically and practically. Wrong assumptions and designs of existing encoding schema. Unending loop of assigning code point, OS update, new font, new input method table Industries are happy. (users suffer) Solution-1 Parts or components as encoding unit. 日月金木水火土人心手口女艹疒犭 Most characters can be represented by a finite set of basic parts. Strokes are used to construct rarely used parts.( thousand of parts appear only once or twice) Solution -2 A close-set of basic parts and strokes as encoding unit. 3 Joining operator : horizontal , vertical, and enclosing. 1 Shielding operator : for hiding stroke Prefix notation : allowing recursive composition. Solution-3 Ordinary CJK fix-length encoding schema, numeric value as character code. • Input method table Convert input keystroke to character code. • Static Font file Glyph data is pre-designed Access glyph data by character code. • Text file Sequence of character code. Solution-4 Additional feature of variable length encoding CJK environment. • Input Character can be sorted, filtered by parts. Compatible with any existing input method. • Display Font file stores commonly used characters and parts. Generate glyph on the fly by glyph descriptive sequence. • Storage and data-exchange Compatible with Unicode. Ideographic description sequence. Dynamic Glyph Generator Input: • Various type of Variable length descriptive character code sequence. 構字式 of Academia Sinica 組字式 of CBETA Unicode ideographic descriptive characters Output: display & print • True-type compatible outline • Rasterized bitmap. • Macromedia Flash, SVG The Task: a layout problem, fitting a 1 dimensional sequence into a 2 dimensional square. Implementation -1 The system consists of 3 major parts Glyph decomposition database • Courtesy of Prof. Hsieh from Academia Sinica, Taiwan http://www.sinica.edu.tw/~cdp/ Outline of strokes and components • Beijing ZhongYi Co. professional outline font vendor. http://www.zhongyicts.com.cn/ The eForth system: putting everything together, hardware-software coengineering. Implementation-2 Glyph decomposition database • All CJK glyph defined by Unicode 4.0 , 71000+ in total. • 549 basic parts, stroke sequence are preserved • 3996 total parts • Total parts frequency :165122 • Accumulated frequency: Top 50 : 51389 = 31% Top 200 : 87381 = 53% Top 1000: 129393 = 78% Implementation-3 Stroke are describe as a outline with skeletal line. Both outline and skeletal line are Quadric Bezier curves. Outline points are recalculated according to scaled- skeletal line. Result: • Stroke data is highly reusable • Stroke weights are adjustable Implementation-4 Automatic frame calculation • Algorithm of estimating the complexity of each parts, to decide the proportion of the part in result glyph. • 漁: 氵25%, 魚 70% , roughly. • 觀 : 雚 55%, 見 40%, roughly. Result: • Clear glyph descriptive expressions • Search engine friendly • Human readable Integrating into existing OS/GUI String manipulation library • Number of characters -1 for operators, +1 for characters • Characters width Graphic sub-system • drawing a text line (e.g. ExtTextOut) Text handling widgets • Awareness of glyphs expression for caret, selection and delete/backspace. Other Issues Quality of the glyph • Trade-off with space: More part outlines, better quality. Speed of generation • No problem for IBM PC, glyph generation is rare. • For handheld device, Hardware acceleration is recommended. Examples ⿱ ⿰ ⿴ – Vertical combination Horizontal combination enclosing hide 盟 = ⿰明皿 or ⿰⿱日月皿 李世民 = 民-5 hide 5th stroke 玄燁 = 玄-5 丘-4 = U+20009 Thank You