ErrCorrection-ICUDL - Universal Digital Library

advertisement
Error Detection and
Correction in Metadata
Nilu Prahallad, Zhenkun Zhou,
Ting Zhang and Vamshi Ambati
Carnegie Mellon University, USA and Zheijiang
University, China
1
Agenda

Typical errors in Metadata
 Title
 Language
 Subject
 Other
fields
Correction Strategies
 Future Research directions

 Learning
from Example
2
Universal Digital Library

Large scale digital collections and archive - first
of its kind



1.46 Million Books
21 different languages
Large scale distributed collaboration - first of its
kind
Four countries - USA, China, Egypt, India
 35 scanning locations
 3000 people (or more…)

3
What has kept us busy for last 1 year?

We reached 1 M books at our last meeting
in EGYPT
 Aggregating
and Cleaning the metadata took
us 1 complete year

Metadata is the most important
component in a Library, more so in a
Digital Library
 Humans
works in strange ways that
computers don’t YET
4
What is metadata?

Information to identify a book
 Title,
Author, Year, Language, Subject,
Publisher, Copyright
Dublincore standard
 Strcutural metadata - METS standard

5
Why do we have problems in
Metadata?

Cataloguing in libraries by professionals is
accurate but expensive
 $100

per book?
At ULIB we want to get things done on a
large scale but economically
 We
are not limited by our visions, but our
funds

To Err is Human
6
Nature of the Problems

Data Entry problems
 Genuine
confusion
 Careless entry
Data Normalization
 Multiple languages and Standards

 Although
not a problem, absolutely necessary
for multilingual access
7
What are the solutions on table?

Manual effort


Original born digital metadata records


Not all books have them, coordinating to get these is
time-consuming
Complete Automatic, Unsupervised


Reliable but expensive and time consuming
Not reliable, more good than harm?
Semi-supervised techniques


Manual 20% , Automatic 80%
We think we know how to work in such a scenario
8
Going Semi-Automatic
Computers are really good at Anomaly
Detection
 We identify and perform automatic
correction for most confident records and
put all doubt cases for manual observation

9
Language Identification
Problems and Solutions
Work done by Nilu Prahallad
10
Scale of the Problem
1.46 million books in digital library
 0.4 million books were tagged with wrong
language/no language at all

11
Problems in Language
Blank Language field
 Wrong Language assigned
 Non-standard conventions
 Multilanguage confusion

12
Blank Language Field
This book is a French book, data entry operator may not know
the language, so he must have tagged as unknown
13
Wrong language assignment

Data entry errors (Copy/paste errors)
A
bulk of books is given a random language
Lack of language knowledge
 Not all data operators know/identify/speak
all languages that we itend to digitize

14
Wrong language assignment
The above is a chinese book which talk about Japanese ethics
There is Japanese in the title which made the operator to tag it as
a Japanese book, instead of chinese
15
Non Standard Conventions
 Different
 Ex:
data entry conventions
English , ENGLISH, en, eng,
 Typographic
operators
 ENGLIS,
errors by the data entry
ENGL etc
16
Multilanguage confusion
This book is a Chinese book which talks about the
techniques of reading and its approaches Language field
is wrongly tagged as English, instead it should be
Chinese.
17
Impact on ULIB
Due to the errors mentioned in the above
slide, the goal of the digital library is
hindered
 Accurate and complete access to online
books is not available though the book is
available in the servers

18
Solutions
Automatic detection of the Language
 Method:


Automatic detection of the language is found using
the language models

The steps involved in building the above models are:
1. Obtain unique tri letter in each document
 2. Compute TF-IDF weights for each of the term.


To perform identification of the language for a given title, the
steps are:
1. Obtain terms from the query title.
2. Compute Cosine correlation between the query title and all the documents
3. Find the document which produces maximum correlation with the querytitle.
4. The language of the query-title is the same as the language of the
document producing the maximum correlation.
19
Solutions

Advantages:
Our program can detect the language exactly the book
belongs even though multiple languages are mentioned
in the title.
 Though the language is tagged as unknown, we can find
the language of the books programmatically.
 We can correct the errors in the language using the
language model and MMR (maximal marginal relevance)
by taking the correlation factor for the title and the
corresponding language and the finding out the least
possible occurrences in the language.


Disadvantages:

This procedure is not 100% accurate, but gives the
desired results in most of the cases.
20
Subject Categorization
Problems and Solutions
Ting Zhang
21
General Information
Total Chinese and English books:
1,027,840
 Total number of combinational subject:
210,439

22
Need for Subject Categories

Subject navigation

Narrow the range of
search down
23
Problems with Subject
Wrong Categorization
 Blank Subject field
 Non-English subject field
 Mixed Language subject field
 Very-detailed subject field

24
Wrong categorization

A History book got classified into
Geography
25
Blank Subject

Almost 300K books have “NULL” subject
information
26
Non-English subject
An English language book tagged with
Chinese subject
 A Chinese language book tagged with
Chinese subject might be ok, but would
create issues for multi-lingual search and
access
 Mixed language subject

27
Non-English subject
Chinese book with Chinese subject
28
Mixed Language Subjects
Subject of this book is described in a mixture
of English and Chinese
29
Very detailed subjects

Almost every book is tagged with a
distinct variation of the Subject
30
What needs to be done?
Standardize the set of subjects like art, biology, medicine, physics
etc. We have made 29 such standard subjects, and we made sure
that we have mapped all the sub subjects to one main subject. This
made most of the books compress and fit into the 29 range of the
subjects.
All the 29 catalogues are based on the CLC (Chinese Library
Classification)
Appendix 1
31
Solution: Semi-Automatic
A librarian manually categorizes one book
into a particular category
 A Programmer writes a program to identify
all titles in the ULIB collection that have
overlap of title words and attaches the
subject tag
 Continue process for at least 20% of the
books and the 80% get corrected
automatically

32
Our Progress with the solution:
More than 600K Chinese books got a main subject
category .
Amount of subjects
Amount of Books
500000
Amount of Books

452348
450000
400000
350000
300000
250000
200000
150000
174019
101940
100000
50000
49511
10238
0
1
2 - 10
71087
4890
10 - 100
subjects' frequency
2765
100 - 300
1657
>=300
33
TITLE Correction
Problems and Solutions
Zhenkun Zhou
34
‘Title’ Statistics


There are more than 1,466,000 books
There are more than 1 million titles not
in English, but in 20 other languages
35
Issues with TITLE field
Illegal characters
 Incomplete and incorrect titles
 Varying Character-sets
 Spelling Variations (old / new variations)
 Segmentation and Tokenization
 Non-native language titles

36
Illegal characters
 Punctuation
marks mostly
 Examples
 " Watch Out for the Foreign Guests! "
37
Incomplete Titles

Incomplete titles or Partial titles
Examples
There are about 37 books with the same title
“Annual report”
In fact, their titles should be such as
“Hong Kong Immigration Department Annual Report of the
Year 2000-2001”
38
Varying character sets
 Titles

in different character sets
GBK, UTF8, ASCII
39
Varying spelling style
 Example


明實錄:明太宗實錄
明实录:明太宗实录
 Same
traditional Chinese
simplified Chinese
is true with Arabic old and new
40
Segmentation and Tokenization

Not a problem, but an issue
 Most
languages have word level
segmentation, “ “, which helps text
processing
 For Chinese, it’s not easy to deal
segmentation problem which prevents
word level search on titles
41
Non-native language titles
Standard transliteration notation for
enabling cross-language search ability
Ex.
“齐白石””Qi Bai Shi”
 Displaying the Transliteration and
equivalent Translation of a book would
enable us to know what the book is
about

42
Solutions
 For
Titles with punctuation mistakes
or Incomplete titles

Using some parsing tools to correct
Ex. Perl
Advantage: use regular expression to control
different situations
Disadvantage: can’t predict all situations,
sometimes not preciously
43
Solutions
 For

Titles in different character sets
change the book titles into UTF character sets,
Ex. UTF8 characters.
44
Solutions

For Titles in different spelling style

change the different titles of the same book in one style

Ex.



“中国”,”中國””中国”
Advantage: offline, easy
Disadvantage: bad expansibility , not correct in concept
Transform titles between styles

Ex.
“中国”“中國”
“中國”“中国”


Advantage: online, good expansibility
Disadvantage: need process time
45
Solutions

Title translation and transliteration
Translate titles from different language.
Ex. “中国历史” ??- “Chinese History ” !!

Automatic Translation (Zheijiang Univ) and
Transliteration module (open source tool)
46
Future Research Directions
Vamshi Ambati
47
Subject Categorization

Text Categorization
 Requires
large amount of text
 At ULIB, not all languages have an OCR

Can we do well with spare data
 Semantics

of words using Wordnet
Can we use contextual information
 Ex:
Jane Austin, Charles Dickens - Literature
 Ex: Swami Prabudha - Religion
48
Language Identification

Our ‘byte frequency’ based language
identification approach has a lot of
problems when the languages are close
 Hindi,

Sanskrit
Can we use larger context
 Longer
character sequences
 Functional words -’of’,’the’ (English)
 Dictionaries

Language Identification from Images
49
Agents that learn by Example

OCLC has the arguably most accurate data
we have so far
 Can
we programmatically access it, compare
with our existing data and correct it

Some of the information regarding books
is available on multiple catalogues all over
the web (including Wikipedia)
 Can
we benefit from this
50
Language Translation

Good Enough Translation for Titles and
Subjects
 Universal
Dictionary of All Languages
(Dr.Shamos) could be a starting point
Google Translation Systems could help
 System at Xia Men University in China has
already helped us do the translation
 We at CMU, IISc will address most of the
other languages

51
Thank you
Suggestions/Questions?
52
Appendix
Subjects list
53
Appendix 1
catalogue
Content
Agriculture
Agricultural engineering、agronomy、gardening、forestry、
herding、veterinary、hunting、silkworm、bee、aquatic
product、fishery etc
Architecture
art of building 、architectural science including:
architectural exploration 、architectural design 、
Architectural Structure 、soil mechanics 、building’s
foundations 、Building materials 、Construction
Technology 、building equipment 、regional planning 、
town planning 、public works)
Art
Painting 、calligraphy 、seal cutting 、photographic art 、
industrial art 、music 、dance 、drama 、cinematic 、
television art
54
Appendix 1
Astronomy
Astronomy
Biography
Biography
Biology
General biology 、cytology 、genetics 、biochemistry 、
biophysics 、molecular biology 、bioengineering 、
environmental biology 、paleontology 、microbiology 、
botany 、zoology 、insect logy 、anthropology
Chemistry
inorganic chemistry 、organic chemistry 、
Macromolecule Chemistry| Polymer Chemistry 、
physical chemistry 、theoretical chemistry 、analytical
chemistry 、applied chemistry
Computer Science
Automatic 、computing technique
Economics
Political economics 、economic profile 、economic
history 、economic geography 、economic planning 、
economic management 、agricultural economy 、industry
economy 、traffic and transport economy 、trade 、
marketing
55
Appendix 1
Education
Education、 education at all levels 、all forms of education 、
Information & knowledge dissemination、 cultural activities
Engineering
General industrial Technology 、mineral engineering 、
petroleum and natural gas industry 、metallurgical industry 、
metallographic & smith craft、 machinery & meter craft 、
weapon industry 、energy industry 、atomic energy
technology 、electro engineering 、radio electronics &
telegraphy、 Chemical industry、 light industry & Handicraft、
Hydraulic Engineering、 Transportation、aviation & space flight
Environmental
Science
Environmental Science
Geography
Human geography 、nature geography、 geophysics、
topography、meteorology、geology、oceanography
56
Appendix 1
History
archaeology、 folkways
Language
Linguistics、 minority language、 foreign language、 all kind of
language systems
Literature
Literary theory、 Chinese literature etc.
Mathematics
Mathematics
Medicine
Basic Medicine、 clinical medicine、 preventive medicine、
hygiene、 pharmacy etc.
Military
Strategy、 tactics、 military campaign、 military technology、
military geography etc.
Natural
Science
System theory、 methodology etc.
57
Appendix 1
Philosophy
Logic、 Ethics、 aesthetics etc.
Physics
Dynamics、 physics etc.
Poetry
poetry
Politics & Law
Diplomacy、 political relations、 law etc.
Psychology
Psychology
Religion
religion、 divination、 superstition etc.
Social Science
Management theory、 statistics、 sociology、 demology、 science of
personnel ect.
General
Encyclopedia、 dictionary、 book catalogue & Abstract & indexing
etc
Miscellaneous
Not included in above catalogue
58
For one million books..
Chinese books and English books are mainly tagged wrong out of 1 million books
59
Download