Overview of Erudition Databases

advertisement
Erudition 愛如生 Databases
Yang Jidong
Univ. of Michigan
Databases of pre-modern Chinese texts:
an overview

Scripta Sinica (漢籍電子文獻), Academia Sinica,
Taipei. PROS: Earliest full-text database of premodern Chinese works. Based on modern
punctuated and annotated versions. High accuracy.
Acceptable price ($2,500/yr). CONS: Early
contents based on Big5 encoding, many missing
characters. Uneven qualities of digitized texts.
Lack of strong financial and technological support
for future development.
Databases of pre-modern Chinese texts:
an overview (cont’d)

Guoxue baodian (國學寶典), Guoxue Company,
Beijing. PROS: Simplified characters (could be a
con to some users). Most texts punctuated.
Available in various media (Web, flash drive, CDROM, etc.). CONS: Relatively low accuracy. Early
contents based on GB encoding with many missing
characters. Some contents may have copy-right
issues (challenged by various Chinese publishers
in court).
Databases of pre-modern Chinese texts:
an overview (cont’d)

Unihan databases (書同文數據庫), including 四
部備要, 四部叢刊, 石刻文獻, 明清實錄、會典, 中
醫古文獻, etc., Unihan Company, Beijing. PROS:
High accuracy. Using Unicode from the beginning.
Reliable and fast access. Relatively low prices
(one-time charges only for most databases, no
annual fee). CONS: Except for the 四部備要 and 四
部叢刊, most databases target researchers with
specific interests such as those studying stone
inscriptions and history of Chinese medicine.
Databases of pre-modern Chinese texts:
an overview (cont’d)

Wenyuan Ge Siku quanshu (文淵閣庫全書),
digitized by Unihan Company and marketed by
Dizhi in Hong Kong. PROS: A comprehensive
collection of pre-modern works in the “four
classes.” CONS: Same as Siku quanshu, a “purified”
and “politically correct” library of the Qing
Dynasty.
Erudition databases: background



Liu Junwen 劉俊文, CEO and Chief Editor of
Erudition. Well-known scholar in Tang history.
PKU history professor. Authority in the study of
medieval Chinese political and legal systems.
Compiler of 《日本學者論中國古代史》.
Investment from a private source.
Most other Chinese database companies received
heavy financial support from the government, but
Erudition “did not get a penny” so far.
Erudition databases: content


Main product: 中國基本古籍庫, containing 10,000
titles. All digitized from print editions “carefully
selected” by experienced scholars and believed by
many experts to be the best available. Very high
accuracy. Because of its high quality, the database
is cited by many recent studies on pre-modern
China as a primary source (just like the 中華書局
edition of the 24 dynastic histories, and the 中國古
典文學叢書 from 上海古籍出版社).
Other Erudition databases: 俗文學,方志, 申報, 敦煌文
書, etc.
Erudition databases: technology


Based on Unicode. Over the past decade, the
number of Unicode unified CJK ideographs has
grown to more than 70,000 (by comparison, the 漢
語大詞典 has more than 50,000 characters).
OCR (followed by manual proofreading) was used
in the production of the 中國基本古籍庫. But it was
abandoned by Erudition soon after that. All new
databases are made by human inputting (using
stroke-based—not romanization-based—input
methods).
Erudition databases: marketing



The Erudition database (esp. the 中国基本古籍库)
sells pretty well in China, Hong Kong, Taiwan, and
Japan. In the West, Berlin State Library and Library
of Congress are their customers.
Harvard, Columbia, Princeton, Yale, Chicago,
Michigan, and Northwestern have recently formed
a consortium to purchase the 方志 and 俗文學
databases from Erudition.
Erudition databases are the most expensive among
Chinese databases. Consortium purchase is the
only solution for North American institutions.
Download