HKIUG Unicode Task Force and the EACC to Unicode Migration

advertisement
7th Annual Hong Kong Innovative Users Group Meeting
11 and 12 December 2006
HKUST Library
HKIUG Unicode Task Force and
the EACC to Unicode Migration
Ki Tat LAM
Head of Library Systems
The Hong Kong University of Science and Technology Library
lblkt@ust.hk
Last revised: 10 December 2006
Contents


HKIUG Unicode Task Force
 CJK/Unicode Resources and the Unicode
Version of TSVCC Table
Migrating INNOPAC’s storage environment from
EACC to Unicode
 MARC-8 and Unicode Environments
 Outstanding Issues
HKIUG Unicode Task Force and Unicode Migration – K.T. Lam, HKUST Library
2
Observations …
HKIUG Unicode Task Force and Unicode Migration – K.T. Lam, HKUST Library
3
曆
[Calendar]
歷
历
[History]
曆法
历法
Simplified form of
曆 and 歷
[System for determining the
beginning, length and divisions of
a year]
HKIUG Unicode Task Force and Unicode Migration – K.T. Lam, HKUST Library
4
曆法 was
incorrectly
displayed as 歷法.
Is it a data entry
error? a display
problem? or what?
HKIUG Unicode Task Force and Unicode Migration – K.T. Lam, HKUST Library
5
Observation #1:
 Although OCLC WorldCat’s storage environment
has been migrated to Unicode and its
Connexion client is Unicode-based, works are
not finished yet. There are still problems that
require attention
 How about INNOPAC and its Unicode Storage
Environment? How ready is it for existing EACCbased sites to migrate to?
HKIUG Unicode Task Force and Unicode Migration – K.T. Lam, HKUST Library
6
U+5386
HKIUG Unicode Task Force and Unicode Migration – K.T. Lam, HKUST Library
7
Export
(in MARC-8)
HKIUG Unicode Task Force and Unicode Migration – K.T. Lam, HKUST Library
8
Export output is {27 46 2A} – incorrect!
HKIUG Unicode Task Force and Unicode Migration – K.T. Lam, HKUST Library
9
Round-trip Crosswalk Failure
EACC
Library
4. Library receives 历
in EACC {27462A},
which is the simplified
Step 2:
form of 歷 U+7CFB 系
Export from OCLC
3. Connexion finds
{274349} and {27462A}
in mapping table and
decides to output 历 in
EACC {27462A}
1. Library contributes 历
in EACC {274349},
which is the simplified
form of 曆
Import to OCLC
OCLC
WorldCat
HKIUG Unicode Task Force and Unicode Migration – K.T. Lam, HKUST Library
2. Connexion finds
{274349} in mapping
table and stores 历 in
Unicode U+5386
Unicode
10
Observation #2:
 The failure of round-trip crosswalk between
systems will continue to be a problem until
everyone interchanges MARC records purely in
Unicode. This will only be achieved when
majority of systems store and use data natively
in Unicode
 Immediate need for INNOPAC sites to migrate to
Unicode storage environment!
HKIUG Unicode Task Force and Unicode Migration – K.T. Lam, HKUST Library
11
HKIUG Unicode Task Force

In 2003-2004, an ad hoc group of systems
librarians and catalogers from member libraries
worked closely with Innovative Interfaces, Inc.
(III) on issues related to CJK and the EACC to
Unicode mappings.
 Developed HKIUG Version of the EACC to
Unicode mapping table
 Resolved EACC to Unicode multi-mapping
problem
 Began drafting TSVCC (Traditional, Simplified,
Variant Chinese Characters) table
HKIUG Unicode Task Force and Unicode Migration – K.T. Lam, HKUST Library
12
HKIUG Unicode Task Force [2]

February 2005, the HKIUG Unicode Task Force
was officially established to:
 maintain the CJK/Unicode resources
produced in 2003-2004;
 develop new resources, such as the Unicode
Version of the TSVCC table;
 facilitate the searching, display and retrieval
of CJK records in library catalogs; and
 assist member libraries in migrating from
EACC-based character encoding to Unicode
HKIUG Unicode Task Force and Unicode Migration – K.T. Lam, HKUST Library
13
HKIUG Unicode Task Force [3]

Member of the Task Force:







CHAN Wai Ming (Secretary), University of Hong Kong
HO Yee Ip, Chinese University of Hong Kong
LAM Ki Tat (Chair), The Hong Kong University of
Science and Technology
Joanna PONG, City University of Hong Kong
SUN Zehua, The Hong Kong University of Science
and Technology
Mr. Philip WONG, City University of Hong Kong
Recruiting new members – we welcome
colleagues to join force …
HKIUG Unicode Task Force and Unicode Migration – K.T. Lam, HKUST Library
14
HKIUG Unicode Task Force [4]

Achievements in 2006:
 July 2006 - finished and released the Unicode
Version of the TSVCC Table
 August 2006 - released the CJK/Unicode
Resources developed over the past three
years to the Internet for open access
[http://hkiug.ln.edu.hk/unicode/]

November 2006 – visited Hong Kong Shue
Yan College (HKSYC) Library to study its
Unicode Storage Environment; and reported
outstanding issues to III.
HKIUG Unicode Task Force and Unicode Migration – K.T. Lam, HKUST Library
15
TSVCC Table - Unicode Version



When searching 历法 “Li fa”, you will prefer to
retrieve records that have:
 历法
 曆法
where 曆 and 历 have a Traditional – Simplified
relationship
Similarly, when searching 屏, you will prefer to
retrieve its Variant 屛
Requires linking T,S,V forms during searching
HKIUG Unicode Task Force and Unicode Migration – K.T. Lam, HKUST Library
16
TSVCC Table - Unicode Version [2]

Results of implementing TSVCC Linking:
 Improvement in searching – higher recall
 Trade-off – lower precision
 If search results are sorted/displayed in
TSVCC normalized form, misleading and
inaccurate display may occur - such as the
OCLC Connexion browse list display problem
mentioned previously
HKIUG Unicode Task Force and Unicode Migration – K.T. Lam, HKUST Library
17
TSVCC Table - Unicode Version [3]

HKIUG Unicode Task Force constructed two
versions of TSVCC tables
 EACC Version [1.0 released August 2005]
 Unicode Version [1.0 released July 2006]
for INNOPAC systems that store characters in
EACC and in Unicode respectively
EACC Version
Unicode Version
No. of link cases
3145
3447
No. of characters
7190
7962
HKIUG Unicode Task Force and Unicode Migration – K.T. Lam, HKUST Library
18
TSVCC Table - Unicode Version [4]

TSVCC link cases collected in the Unicode
Version are:
 derived from the EACC Version, e.g.
EACC link, U+XXXX multi-mapped;


harvested from Unicode Consortium’s Unihan
Database, e.g.
kSimplifiedVariant, kZVariant;
proposed by the Unicode Task Force
members, e.g.
hkiugSimplifiedVariant, hkiugZVariant
HKIUG Unicode Task Force and Unicode Migration – K.T. Lam, HKUST Library
19
TSVCC Table - Unicode Version [5]

Examples of Link Cases in Unicode Version:
U+66C6 曆 | U+5386 历 | U+66A6 暦 | U+6B77 歷 |
U+6B74 歴 | U+F98B 曆 | U+F98C 歷 | #EACC link
([21/27/2D]4349),([21/27/4B]462A) AND U+5386 multimapped 27462A,274349 AND kZVariant of U+F98B is
U+66C6 AND kZVariant of U+F98C is U+6B77
U+5C5B 屛 | U+5C4F 屏 | U+6452 摒 | #EACC link
([27/21]415A) AND hkiugZVariant of U+5C4F is U+5C5B
HKIUG Unicode Task Force and Unicode Migration – K.T. Lam, HKUST Library
20
HKIUG Unicode Task Force and Unicode Migration – K.T. Lam, HKUST Library
21
HKIUG Unicode Task Force and Unicode Migration – K.T. Lam, HKUST Library
22
TSVCC Table - Unicode Version [6]


Support linking of CJK Compatibility Ideographs
 e.g. [U+F92F 勞] in the previous screen dump,
a variant from KS C5601-1987
Support linking of forms used differently in
Mainland China and in Hong Kong, for example:
HKIUG Unicode Task Force and Unicode Migration – K.T. Lam, HKUST Library
23
TSVCC Table - Unicode Version [7]

We welcome contribution from CJK experts and
colleagues of member libraries to enhance the
TSVCC tables
 e.g. projects to establish TSVCC links from
Hangul Syllables, Hiragana and Katakana to
CJK ideographs
HKIUG Unicode Task Force and Unicode Migration – K.T. Lam, HKUST Library
24
MARC-8 and Unicode
Environments

In 2000, the Library of Congress issued:
Specifications to distinguish the encoding of
MARC 21 records in the original (MARC-8)
environment and in the new UCS/Unicode
environment
[http://www.loc.gov/marc/specifications/speccharintro.html]

MARC-8 means characters are encoded in one
8-bit byte (e.g. ASCII) and three 8-bit bytes (e.g.
EACC)
HKIUG Unicode Task Force and Unicode Migration – K.T. Lam, HKUST Library
25
21 62 62 21 39 25 21 30 21
黃
大
一
A MARC 21 bibliographic record in
ISO2709 format viewed in Notepad,
showing CJK characters encoded in
EACC in MARC-8 environment
HKIUG Unicode Task Force and Unicode Migration – K.T. Lam, HKUST Library
26
MARC-8 and Unicode
Environments [2]

UCS/Unicode Environment
[http://www.loc.gov/marc/specifications/speccharucs.html]





Use UTF-8 as character encoding
Leader position 9 contains value “a”
Field 066 (Character Sets Present) is not
needed
The script identification information in subfield
6 (Linkage) can be dropped
Lengths specified by number of 8-bit bytes,
rather than number of characters.
HKIUG Unicode Task Force and Unicode Migration – K.T. Lam, HKUST Library
27
MARC-8 and Unicode
Environments [3]

Unicode combining rule for diacritics, i.e.
combining marks follow rather than precede
the character they modify
HKIUG Unicode Task Force and Unicode Migration – K.T. Lam, HKUST Library
28
A MARC 21 bibliographic record in
ISO2709 format viewed in Notepad,
showing CJK characters encoded in
UTF-8 in UCS/Unicode environment
HKIUG Unicode Task Force and Unicode Migration – K.T. Lam, HKUST Library
29
Migrating from EACC to Unicode

The following INNOPAC systems are in Unicode
Storage Environment:
 HKSYC (Hong Kong Shue Yan College)
 HKALL (the INN-Reach system for the eight
universities in Hong Kong)
 HKUST Tool Testing Database
HKIUG Unicode Task Force and Unicode Migration – K.T. Lam, HKUST Library
30
Migrating from EACC to Unicode [2]

HKSYC Visit
 A group of systems librarians and catalogers
from member libraries visited HKSYC Library
in November 2006 to learn how its INNOPAC
system works in Unicode Storage
Environment
 A number of outstanding issues were
identified and/or confirmed
 If you have migrated to Unicode storage or
plan to migrate now, you might also face the
same problems
HKIUG Unicode Task Force and Unicode Migration – K.T. Lam, HKUST Library
31
Migrating from EACC to Unicode [3]

Outstanding Issues
 TSVCC Linking not turned on; and even if
turned on, it would not be using the latest
HKIUG version
 When entering CJK characters via Millennium
Editor, such as U+8AAC 説 and U+7CB5 粵,
and saving the record, these characters would
be stripped away and not saved - destructive
bug awaiting fixing
HKIUG Unicode Task Force and Unicode Migration – K.T. Lam, HKUST Library
32
Migrating from EACC to Unicode [4]

Export from INNOPAC - only export in MARC8 Environment was provided. There should be
option for users to export in Unicode
Environment
• III replied that this option is available

Import (Load) into INNOPAC - only import in
MARC-8 Environment was provided. There
should be option for users to load MARC
records in Unicode Environment (i.e. in UTF8).
• III replied that this option is available
HKIUG Unicode Task Force and Unicode Migration – K.T. Lam, HKUST Library
33
Migrating from EACC to Unicode [5]

It seemed that sorting at HKSYC is still
EACC-based
• Sorting key seemed to be constructed from:
[No. of strokes][EACC code value]
• For example, as observed from WebPAC’s URL,
sorting key for 中國 is:
“04{213034}11{21376f}”.
It should instead be sorted in Unicode code value,
i.e.
“04{u4e2d}11{u570b}”
HKIUG Unicode Task Force and Unicode Migration – K.T. Lam, HKUST Library
34
Migrating from EACC to Unicode [6]

Also need to fix the illogical sorting orders as
found in HKUST’s Tool Testing Database:
1: ASCII space/punctuations (e.g. :)
2: ASCII numerals (e.g. 1)
3: CJK characters with pinyin (e.g. 中)
4: ASCII Alphabets (e.g. a)
5: CJK characters without pinyin (e.g. を)
HKIUG Unicode Task Force and Unicode Migration – K.T. Lam, HKUST Library
35
Migrating from EACC to Unicode [7]

Pure Unicode Storage Environment
• Once migrated to Unicode Storage Environment,
there should not be needs for mapping back and
forth between EACC and Unicode, except for
some necessary conversion routines
• In order to maintain a natively Unicode
environment, EACC dependence should be
identified and eliminated
HKIUG Unicode Task Force and Unicode Migration – K.T. Lam, HKUST Library
36
Conclusion

How far are we towards native Unicode?
 Both LC and OCLC have done enormous
work in enabling and promoting the use of
Unicode in MARC records
 ILS vendors including III are working very
hard to implement and enhance the Unicode
support
 Libraries and CJK experts are providing
advice and suggesting solutions
HKIUG Unicode Task Force and Unicode Migration – K.T. Lam, HKUST Library
37
Conclusion [2]

Migrating INNOPAC to Unicode
 We have reviewed various outstanding issues
as found in INNOPAC’s Unicode Storage
Environment
 We hope these issues will be resolved quickly
so that HKIUG member libraries can start to
migrate their systems to Unicode
 HKIUG Unicode Task Force will continue to
work closely with III to enable a smooth
migration
HKIUG Unicode Task Force and Unicode Migration – K.T. Lam, HKUST Library
38
Additional Readings

K.T. Lam. EACC to Unicode migration. OCLCCJK Users Group 2006 Annual Meeting.
[http://hdl.handle.net/1783.1/2500]

Wong, Philip and K.T. Lam. HKIUG’s Unicode
projects : untangling the chaotic codes. HKIUG
Annual Meeting 2005.
[http://hdl.handle.net/1783.1/2429]
HKIUG Unicode Task Force and Unicode Migration – K.T. Lam, HKUST Library
39
Thank You!
HKIUG Unicode Task Force and Unicode Migration – K.T. Lam, HKUST Library
40
Download