Tafseer Ahmed Department of Computer Science University of Karachi Types of Urdu Software Development Word Processing Word Processors, Active-X Controls (Text Box, List, Buttons, Menus) Information Processing Character Encoding, Database Operations (Sort, Search, Comparison) Language Processing (Grammar Checkers, Translators, Speech Synthesizers and Recognizers, OCRs) Character, Script, Glyph and Font Character The character identified is an abstract entity, such as "LATIN CHARACTER CAPITAL A" or ”ARABIC CHARACTER HA”. Every Character has only one position/ code point in character representation schemes like Unicode. Script Script is writing Style of a language. For Example, English and French are written in Roman Script and Urdu and Farsi are written in Arabic Script Glyph The visual representation of the character made on screen or paper is called a Glyph. A Character can have more than one Glyphs. Character Encoding Data and hence Text is stored in computer using Binary Numbers. Character Encoding scheme like ASCII, EBCIDIC gives mapping of (English) Characters to Binary Numbers (for storage and processing). Character of any language can have character encoding. This is basis of Code Pages. Every language has a Code Page which have encoding of that language’s characters. Character Encoding of Urdu Propriety Standards (Biggest Problem in Urdu Software Development) Urdu Zabta Takhti (national standard code page of Urdu) Unicode (International Standard for Multilingual Characters) Urdu Zabta Takhti Unicode Unicode is repository of characters of almost all languages of the world. Unicode has more than 65,000 codepoints for characters. All Software vendors are now supporting or switching to Unicode. 0xFFFF Compatibility Private use Future use Ideographs (Hanzi, Kanji, Hanja) Hangul Kana Symbols Punctuation Thai Indian Arabic, Hebrew ™ Unicode / ISO 10646 16-bit international character encoding Windows 2000 uses Unicode version 2.0 Greek Latin ASCII 0x0000 A (null) 0041 9662 FF96 4F85 0000 Approaches to Urdu Fonts Naskh (Character Based) Nastaleeq (Character Based) Nastaleeq (Ligature Based) Major Problems in developing Urdu Fonts Many Glyphs corresponding to One Character. Only 256 positions are available in Font File. So all Ligatures cannot be stored in a single file. A Special purpose Urdu Word Processor is required to implement glyph joining and substitution logic. Open Type Font (OTF) True Type Font The TrueType font technology consists of two components: TrueType Font TrueType Rasterizer One Glyph corresponding to one Code Point/ Position in the Font File. Open Type Font OpenType is a new cross-platform font file format developed jointly by Adobe and Microsoft. It is an extension of True Type Font. OpenType Font may contain more than 65,000 glyphs. One character may correspond to several glyphs. A rich mapping between characters and glyphs, which supports ligatures, positional forms, alternates, and other substitutions. Information to support features for two-dimensional positioning and glyph attachment. It Explicit script and language information, so a text-processing application can adjust its behavior accordingly Tables in OTF Font CMAP (Character to Glyph Mapping) GDEF (Glyph Definition Data) GPOS (Glyph Position Data) GSUB (Glyph Substitution Data) BASE (Baseline Data) JSTF (Justification Data) GDEF Table Glyph Class Definition Simple Ligature Combining Mark Component Attachment Point List Glyph Attachment Points defined in GPOS Ligature Caret List Table Traditional Urdu Word Processor Open Type enabled Word Processor CMAP The mapping of GIDs of Glyphs to character code point. GSUB information for substituting glyphs to render the scripts and language systems supported in a font. Types of Substitution A Single Substitution replaces a single glyph with another single glyph. An Alternate Substitution identifies functionally equivalent but different looking forms of a glyph. A Multiple Substitution replaces a single glyph with more than one glyph. This is used to specify actions such as ligature decomposition. A Ligature Substitution replaces several glyph indices with a single glyph index. Contextual substitution describes glyph substitutions in context–that is, a substitution of one or more glyphs within a certain pattern of glyphs. Each substitution describes one or more input glyph sequences and one or more substitutions to be performed on that sequence. GPOS precise control over glyph placement for sophisticated text layout and rendering in different script and language system. To properly render Urdu glyph, a text processing client must modify both horizontal and vertical positional of glyph. Entry and Exit Points BASE Contains information about baseline offsets on a script-by-script basis. JSTF Contains justification information, including whitespace and Kashida adjustments. VOLT Visual Open LayOut Tool Developed By Microsoft Glyph Grid Glyph Name Glyph type Glyph ID Unicode Components Glyph Group Glyph Name Glyph Group Glyph Range Glyph Enumeration Substitution Tool Lookup Name Lookup Type Process Marks Process Base Glyph Text Flow Positioning Tool Lookup Header Lookup Type Glyph Positioner Glyph Adjustment (Single,Pair,Anchor) Cursive Attachment Caret Positioning Urdu Support in Software Windows XP Windows 2000 Office 2000 and XP Internet Explorer 5.5 Visual Studio Java Urdu Support in Windows 2000 Input locale (Currency, Date) Keyboard Write Urdu anywhere (Notepad, Windows Explorer) RTL( Right to Left) Controls including Windows, Text Boxes Issues in Urdu Databases Unicode Urdu Characters are not in A Sequence. Need Collating Sequence for Sorting Diacritics (Aarab) Problem in Sorting and Comparison with diacritics. Web Resources http://www.microsoft.com/typography/devel opers/opentype/ http://microsoft.com/globaldev/ http://communities.msn.com/MicrosoftVOLT userscommunity/ http://www.adobe.com/type/opentype/ www.unicode.org