Migration of a 4GL and Relational Database to Unicode Tex Texin International Product Manager © 1998, Progress Software Corporation 1 Presentation Goals • Outline Migration Steps • Describe Design Considerations • Leverage Existing Double-byte Implementation • Describe Impact on 4GL and Report Formats © 1998, Progress Software Corporation 2 PROGRESS Application Development Suite • Powerful tools for the rapid creation of distributed business applications • Creates character, GUI, or web-based clients with common source • Host-based, client-server, or n-tier distribution on variety of platforms • Scalable, robust RDBMS and open • International, double-byte enabled © 1998, Progress Software Corporation 3 Possible Configuration Options GUI Client Client-Server Web-based Client Host-based Optional n-tier Character Client Application Server © 1998, Progress Software Corporation 4 Database Server Progress Database Other Database Why do our customers need Unicode? • Many do not... However, • Multinationals deploy across regions with incompatible character sets, yet they must share data between them. • Programs are distributed worldwide with one container of text in many languages. • Certain applications require multilingual databases. E.g. Translation systems and web-based applications. © 1998, Progress Software Corporation 5 The Existing Architecture • 1.5M lines of C code • 0.3M lines of 4GL code • Double-byte enabled – – – – – CJK, 9 double-byte charsets supported 2-byte only, no 3 or 4-byte No shift-sequenced charsets DBE changes earmarked, easy to find 4 years, 3 developers, 2 QA © 1998, Progress Software Corporation 6 Estimated cost of implementing UCS-2, was very big! • Changing to 16-bit text units affects almost every source module – Largest cost is separating char variables based on usage for text or binary data. – Use 16-bit null terminators, ignore 8-bit “A” 0041, 0000 “Ô 0100, 0000 – Pointer arithmetic (advance 2 bytes) – Sizing (bytes or characters) – New API to use new WIDE TEXT datatype © 1998, Progress Software Corporation 7 Product requirements for a multilingual version • Minimize cost for application migration • Minimize cost for application upgrade • Minimize support cost – One executable! • Maintain user-definable character sets Add UTF-8 as just another character set – UTF-8 algorithms are compatible with other charsets © 1998, Progress Software Corporation 8 Scaled down multilingual proposal: UTF-8 implementation • Implement UTF-8 as 3-byte character set – – – – Leverage & extend double-byte enabling Places to change are already earmarked Restrict to composed characters for now Restrict to no surrogates Supports all the markets we are in • UTF-8-enable 4GL and RDBMS first – Provides multilingual logic and storage – Java+other client technologies coming © 1998, Progress Software Corporation 9 Architecture changes UTF-8-enabling the string library • N-byte enable character+string functions – GetNextChar, GetPreviousChar – GetCharacterSize (table-based) – Modified IsFirstByte • New GetColumnLength • New datatype normalized “BIG” char • Minor algorithm changes for efficiency – Find Character © 1998, Progress Software Corporation 10 Architecture changes UTF-8-enabling character tables • String libraries use character tables – Alphanumeric, Lead-byte, Tail-byte – Upper, lower case (700+ characters) • New property ColumnCount • New table formats – Old architecture presumed 256 byte table – Now organized by range lists and trie • Update table compiler & allow hex entry © 1998, Progress Software Corporation 11 Architecture changes UTF-8-enabling sorting • • • • How to sort multilingual data? Binary sort used for double-byte data With UTF-8, Europe is 2-byte, CJK 3-byte Solution – Binary sort on server – Client uses native sort • Bump key length limit for UTF-8 • Next phase will be enhanced sort © 1998, Progress Software Corporation 12 Architecture changes Character conversion algorithms • Existing, user-definable, conversions – Single-byte character set table maps – Double-byte Shift-JIS - EUCJIS algorithm • New table-driven automated conversions – – – – Single-byte to UTF-8, and back Double-byte to UCS-2 and back UTF-8 - UCS-2 Trie for speed and memory optimization • Requires significant QA for data integrity © 1998, Progress Software Corporation 13 Architecture changes Impact on the 4GL user • 4GL is character set independent • Almost all functions are character-based • 3 functions require optional byte-basing – Length, Substring, Overlay – Options: Byte, Character • Add new option: Column • Format (Picture) Phrase – “XXXX” has different meaning for UTF-8 © 1998, Progress Software Corporation 14 Status • • • • Functioning Well Going to second beta Implemented with very low cost Performance is OK – Metrics not yet available • Testing is most significant cost – Reviewing all character set properties – Evaluating all conversions © 1998, Progress Software Corporation 15 Futures • For the Progress International Team – Multilingual Clients – Enhanced Character Folding – Enhanced Sorting • For Progress Customers – Deployment of multilingual databases – Worldwide access to these databases – Worldwide deployment of multi-language applications © 1998, Progress Software Corporation 17 Conclusions • Migration can be achieved in phases • Migration thru UTF-8 can be low cost • Double-byte applications can migrate easily to UTF-8 • Asian users can integrate with other languages now • Non-English users can integrate with Asian languages now © 1998, Progress Software Corporation 18 © 1998, Progress Software Corporation 19