Text Editing and Display

advertisement
Multilingual Editing using
RichEdit 4+
Hon Wah Chan
Murray Sargent III
Microsoft Corporation
Text Services Group, Word
Introduction
RichEdit is a text engine with a hierarchy of
presentation formats
 Features such as automatic choice of fonts, rich
text, 2D text objects
 Handling nonUnicode documents in Unicode text
engines
 Describe interfaces and component usage
 Ways to input Unicode text using IMEs, speech
 Demo

What’s RichEdit?

RichEdit 4.x is set of plain/rich-text,
single/multiline Unicode/ANSI edit controls and
combo/listboxes in single world-wide binary

Multilevel undo, message & com interfaces, Word
compatibility, pretty rich text

Outline view, zoom, font binding, latest in IME
support, and rich complex script support (BiDi,
Indic, and Thai)
Clients include



Outlook mail note,
post-it
Most Office dialogs
All OSes since Win98

Wordpad, Charmap








Darwin installer
WebCalc
Project
Visual Studio, DaVinci
Publisher
Front Page





Handheld PC PocketWord
eBooks
OE (for mail header)
Borland’s Delphi
SQL server dev tools, RAID
MSN Companion chat
Via Win2k Wrapper – cc:mail,
WebEditPro, Eudora, Encarta,
Money(US), Sibelius, Borland
TRichedit class, apps created
with VB, MFC…
Some Fancier Features
 Features
added for ebooks: pagination,
hyphenation, kerning, ClearType support, text
wrap around embedded objects
 Multilevel
tables
 Autocorrect
 AutoURL detection
(improved from 3.0)
2D Text Objects




RichEdit 4.5 (in development) supports WYSISYG editing
of many 2D objects
Ruby, Tatenakayoko, Warichu, Kumimoji
Math: fractions, autosizing brackets, boxes, matrices,
integrals
Demo will show some of these features
Backward Compatibility







Unicode text engines need to import/export text in other character sets
Given nonUnicode plain text, which codepage should one use to
convert to/from Unicode?
On localized systems, system code page is a good bet
In multilingual text, you can enter text using keyboards in a variety of
languages that need either Unicode or multiple code pages
For searching text, best choice seems to be to use the current keyboard
code page
If text begins with a BOM, it’s Unicode
If text begins with a rich-text header, e.g., “{\rtf” or “<html>”, use
appropriate conversion routine
Backward Compatibility (cont)





Need a little rich-text functionality to display Unicode
plain text unambiguously in some CJK scenarios
This functionality handles font choices and languagedependent glyph variants
When a user types in text using a keyboard charset, edit
engine knows charset and therefore can insert accurate
Unicode text including which CJK glyph variant to use
Client gets text as pure ANSI (or Unicode) text without
script clues
Would be handy to have script tags
Complex Scripts
 Unicode
covers many complex scripts, e.g.,
Arabic, Indic, Thai, ancient Korean
 Complex-scripts require layout engine that
translates character codes to glyph indices
(often referencing ligatures)
 RichEdit uses Uniscribe and the MS linelayout component for complex scripts
Font Binding








Most Unicode characters belong to scripts
Associate with each position in a document a “font bundle”
When inserting characters, assign each one to a script
For CJK, check surrounding characters for Kana and Hangul as clues
to use Japanese or Korean fonts instead of Chinese
Assign scripts to neutrals and digits
Keyboard language, especially IMEs, provide strong binding clues
Format inserted characters with fonts assigned to scripts. Check
current font to see if it supports required script
RichEdit 4.0 has 50 scripts for Unicode 3.1. Client can specify what
default font to use for a given script.
Language Detection & Font Binding





Korean and Japanese are often easy to spot because of Hangul and
Kana characters, respectively
For CJK can convert back to codepage and see if errors occur (Ken
Lunde’s suggestion)
For proofing purposes, accurate language identification is needed. For
font binding, script identification is usually sufficient
Typically more than one language corresponds to a script, e.g., Latin
script. Essentially only one uses the Korean script
Natural language processing techniques allow good language
identification if more than a few words are involved, e.g., a sentence
Font Sizing





In dialogs, 8-pt Latin characters are commonly used
8-pt Chinese characters are hard to read, so better to use 9
points in combination with 8-pt Latin characters
Latin characters have bigger descenders than Chinese
characters, since latter only need room for underline
Combining 8-pt Latin characters with 9-point Chinese
characters and keeping same baseline increases line height
to 9 pts plus extra height for Latin descender
Result is more like 10 points: shifts text too high in dialog
box originally designed to handle one language
Unicode Surrogate Pairs





Using 2 16-bit surrogates to represent a single character
complicates more than measurement and display of
characters:
Arrow-key handlers and other methods that change
character position must avoid ending up in between lead
and trail surrogates
Input methods need to map to surrogate pair
Case changes, line-breaking rules, sorting, file formats,
and backing-store manipulations in general have to
recognize and deal with pairs
Surrogate code ranges make them easy to work with
relative to multibyte encoding systems
Nonspacing Combining Marks





Multicode characters (surrogate pairs, CRLFs, combining-mark and
variant-tag sequences) require special display/navigation handling
Render combining-mark sequences by standard systems calls and fonts
that support combining marks. Better display needs layout engine that
talks to OpenType
Simple caret movement across combining-mark sequences prevents
stopping inside a sequence. Backspace key deletes one mark at a time
Mouse-cursor hit testing leaves selection at beginning/end of
combining-mark sequence (more elegant model allows selection and
editing of individual marks)
Cool thing: if you can navigate past CRLF combinations, you can
modify corresponding code to handle surrogate pairs and combiningmark sequences quite easily
Interfaces




Messages and keyboard
File read/write (plain text or RTF)
TOM (Text Object Model)
ITextServices/ITextHost interfaces
RichEdit Message Interface

System messages





keyboard messages
mouse messages
clipboard messages
Edit messages – RichEdit supports all but four of the
system edit messages
RichEdit messages



Character/paragraph formatting
Text input/query
Notification
File Formats





Plain text can be saved/read encoded in any codepage,
including Unicode and UTF-8
RTF is the principle rich-text format
UTF-8 RTF is used preferentially for cut/copy/paste. Can
be used in stream operations
Copying text to/from Word can be a handy way to get
desired formatting into a RichEdit instance
HTML is available via system converters
TOM (Text Object Model)




A set of COM dual interfaces that allow Unicode rich/plain
text to be manipulated by VB, C/C++, and Java clients.
Access for spelling/grammar checkers
Accessibility
Powerful and efficient text processing primitives.
Embedded scripts
TOM(cont)
 ITextDocumentTop-level
editing object
 ITextStoryRanges
Enumerator for stories in document
 ITextRange
Primary text interface: range of text

ITextFont
Character-attribute interface

ITextPara
Paragraph-attribute interface

ITextTag
HTML Tag interface

ITextAttributes
 ITextSelection

TextRange
Tag-attribute enumerator
Screen highlighted text range
Selection inherits all range methods
ITextServices/ITextHost
Interfaces




Windowless interfaces that go beyond message interface
In-place active state – use window of the container
Fewer system resources
Faster activation and deactivation
Other Components used




Uniscribe
MS line-layout component
Windows Text Services Framework
Callbacks for access to word-break, auto correct,
hyphenation, and Clear Type libraries
Input methods




Support for the latest IMEs
Speech and handwriting input (Windows Text Services
Framework)
Alt-x Unicode input method
Standard hot keys
IMEs






Support Level 2 and Level 3 IMEs
Support Active Input Method Manager (AIMM)
Reconversion - user can convert final string back to
composition mode, allowing easy selection of a different
candidate string.
Document feed - provides IME with text for current
paragraph to increase conversion accuracy during typing.
Mouse Operation - gives user better control over candidate
and UI windows
Caret position - gets current caret and line info, which
IME98 uses to position UI windows (e.g., candidate list).
Windows Text Services
Framework





Provide support for Far East input across language Win32
platforms to aware applications.
Provide consistent UI for different input methods
 speech, handwriting, IME
Coordinated input
Data persistence for dynamic text editing
Richedit supports both the native mode and Active Input
Method Manager (AIMM) mode
Hex to Unicode Input Method







Type Unicode character hexadecimal code
Make corrections as need be
Type Alt+x to convert to character
Type Alt+x to convert back to hex (useful especially for
“missing glyph” character)
Resolve ambiguities by selection
Input higher-plane chars using 5 or 6-digit code
MS Word 2002 standard
Unicode combobox/listbox





Emulate the system combobox and listbox
Unicode supports on all Win32 platforms
Allow mixed languages between items
Modified EM_SETTEXTEX for inserting items
Use in Office applications
Demo
Conclusions







Have described RichEdit, an engine for text display and
editing with a hierarchy of presentation formats
Automatic choice of fonts for Unicode plain text including
surrogate-pair characters, combining mark sequences
Handling nonUnicode documents in Unicode text engines
Described interfaces and component usage
Ways to input Unicode text using IMEs, speech
Clients include many Office and Windows apps
Able to display 2D Text Objects such as Ruby and Warichu
Download