Error Detection and Correction in Metadata Nilu Prahallad, Zhenkun Zhou, Ting Zhang and Vamshi Ambati Carnegie Mellon University, USA and Zheijiang University, China 1 Agenda Typical errors in Metadata Title Language Subject Other fields Correction Strategies Future Research directions Learning from Example 2 Universal Digital Library Large scale digital collections and archive - first of its kind 1.46 Million Books 21 different languages Large scale distributed collaboration - first of its kind Four countries - USA, China, Egypt, India 35 scanning locations 3000 people (or more…) 3 What has kept us busy for last 1 year? We reached 1 M books at our last meeting in EGYPT Aggregating and Cleaning the metadata took us 1 complete year Metadata is the most important component in a Library, more so in a Digital Library Humans works in strange ways that computers don’t YET 4 What is metadata? Information to identify a book Title, Author, Year, Language, Subject, Publisher, Copyright Dublincore standard Strcutural metadata - METS standard 5 Why do we have problems in Metadata? Cataloguing in libraries by professionals is accurate but expensive $100 per book? At ULIB we want to get things done on a large scale but economically We are not limited by our visions, but our funds To Err is Human 6 Nature of the Problems Data Entry problems Genuine confusion Careless entry Data Normalization Multiple languages and Standards Although not a problem, absolutely necessary for multilingual access 7 What are the solutions on table? Manual effort Original born digital metadata records Not all books have them, coordinating to get these is time-consuming Complete Automatic, Unsupervised Reliable but expensive and time consuming Not reliable, more good than harm? Semi-supervised techniques Manual 20% , Automatic 80% We think we know how to work in such a scenario 8 Going Semi-Automatic Computers are really good at Anomaly Detection We identify and perform automatic correction for most confident records and put all doubt cases for manual observation 9 Language Identification Problems and Solutions Work done by Nilu Prahallad 10 Scale of the Problem 1.46 million books in digital library 0.4 million books were tagged with wrong language/no language at all 11 Problems in Language Blank Language field Wrong Language assigned Non-standard conventions Multilanguage confusion 12 Blank Language Field This book is a French book, data entry operator may not know the language, so he must have tagged as unknown 13 Wrong language assignment Data entry errors (Copy/paste errors) A bulk of books is given a random language Lack of language knowledge Not all data operators know/identify/speak all languages that we itend to digitize 14 Wrong language assignment The above is a chinese book which talk about Japanese ethics There is Japanese in the title which made the operator to tag it as a Japanese book, instead of chinese 15 Non Standard Conventions Different Ex: data entry conventions English , ENGLISH, en, eng, Typographic operators ENGLIS, errors by the data entry ENGL etc 16 Multilanguage confusion This book is a Chinese book which talks about the techniques of reading and its approaches Language field is wrongly tagged as English, instead it should be Chinese. 17 Impact on ULIB Due to the errors mentioned in the above slide, the goal of the digital library is hindered Accurate and complete access to online books is not available though the book is available in the servers 18 Solutions Automatic detection of the Language Method: Automatic detection of the language is found using the language models The steps involved in building the above models are: 1. Obtain unique tri letter in each document 2. Compute TF-IDF weights for each of the term. To perform identification of the language for a given title, the steps are: 1. Obtain terms from the query title. 2. Compute Cosine correlation between the query title and all the documents 3. Find the document which produces maximum correlation with the querytitle. 4. The language of the query-title is the same as the language of the document producing the maximum correlation. 19 Solutions Advantages: Our program can detect the language exactly the book belongs even though multiple languages are mentioned in the title. Though the language is tagged as unknown, we can find the language of the books programmatically. We can correct the errors in the language using the language model and MMR (maximal marginal relevance) by taking the correlation factor for the title and the corresponding language and the finding out the least possible occurrences in the language. Disadvantages: This procedure is not 100% accurate, but gives the desired results in most of the cases. 20 Subject Categorization Problems and Solutions Ting Zhang 21 General Information Total Chinese and English books: 1,027,840 Total number of combinational subject: 210,439 22 Need for Subject Categories Subject navigation Narrow the range of search down 23 Problems with Subject Wrong Categorization Blank Subject field Non-English subject field Mixed Language subject field Very-detailed subject field 24 Wrong categorization A History book got classified into Geography 25 Blank Subject Almost 300K books have “NULL” subject information 26 Non-English subject An English language book tagged with Chinese subject A Chinese language book tagged with Chinese subject might be ok, but would create issues for multi-lingual search and access Mixed language subject 27 Non-English subject Chinese book with Chinese subject 28 Mixed Language Subjects Subject of this book is described in a mixture of English and Chinese 29 Very detailed subjects Almost every book is tagged with a distinct variation of the Subject 30 What needs to be done? Standardize the set of subjects like art, biology, medicine, physics etc. We have made 29 such standard subjects, and we made sure that we have mapped all the sub subjects to one main subject. This made most of the books compress and fit into the 29 range of the subjects. All the 29 catalogues are based on the CLC (Chinese Library Classification) Appendix 1 31 Solution: Semi-Automatic A librarian manually categorizes one book into a particular category A Programmer writes a program to identify all titles in the ULIB collection that have overlap of title words and attaches the subject tag Continue process for at least 20% of the books and the 80% get corrected automatically 32 Our Progress with the solution: More than 600K Chinese books got a main subject category . Amount of subjects Amount of Books 500000 Amount of Books 452348 450000 400000 350000 300000 250000 200000 150000 174019 101940 100000 50000 49511 10238 0 1 2 - 10 71087 4890 10 - 100 subjects' frequency 2765 100 - 300 1657 >=300 33 TITLE Correction Problems and Solutions Zhenkun Zhou 34 ‘Title’ Statistics There are more than 1,466,000 books There are more than 1 million titles not in English, but in 20 other languages 35 Issues with TITLE field Illegal characters Incomplete and incorrect titles Varying Character-sets Spelling Variations (old / new variations) Segmentation and Tokenization Non-native language titles 36 Illegal characters Punctuation marks mostly Examples " Watch Out for the Foreign Guests! " 37 Incomplete Titles Incomplete titles or Partial titles Examples There are about 37 books with the same title “Annual report” In fact, their titles should be such as “Hong Kong Immigration Department Annual Report of the Year 2000-2001” 38 Varying character sets Titles in different character sets GBK, UTF8, ASCII 39 Varying spelling style Example 明實錄:明太宗實錄 明实录:明太宗实录 Same traditional Chinese simplified Chinese is true with Arabic old and new 40 Segmentation and Tokenization Not a problem, but an issue Most languages have word level segmentation, “ “, which helps text processing For Chinese, it’s not easy to deal segmentation problem which prevents word level search on titles 41 Non-native language titles Standard transliteration notation for enabling cross-language search ability Ex. “齐白石””Qi Bai Shi” Displaying the Transliteration and equivalent Translation of a book would enable us to know what the book is about 42 Solutions For Titles with punctuation mistakes or Incomplete titles Using some parsing tools to correct Ex. Perl Advantage: use regular expression to control different situations Disadvantage: can’t predict all situations, sometimes not preciously 43 Solutions For Titles in different character sets change the book titles into UTF character sets, Ex. UTF8 characters. 44 Solutions For Titles in different spelling style change the different titles of the same book in one style Ex. “中国”,”中國””中国” Advantage: offline, easy Disadvantage: bad expansibility , not correct in concept Transform titles between styles Ex. “中国”“中國” “中國”“中国” Advantage: online, good expansibility Disadvantage: need process time 45 Solutions Title translation and transliteration Translate titles from different language. Ex. “中国历史” ??- “Chinese History ” !! Automatic Translation (Zheijiang Univ) and Transliteration module (open source tool) 46 Future Research Directions Vamshi Ambati 47 Subject Categorization Text Categorization Requires large amount of text At ULIB, not all languages have an OCR Can we do well with spare data Semantics of words using Wordnet Can we use contextual information Ex: Jane Austin, Charles Dickens - Literature Ex: Swami Prabudha - Religion 48 Language Identification Our ‘byte frequency’ based language identification approach has a lot of problems when the languages are close Hindi, Sanskrit Can we use larger context Longer character sequences Functional words -’of’,’the’ (English) Dictionaries Language Identification from Images 49 Agents that learn by Example OCLC has the arguably most accurate data we have so far Can we programmatically access it, compare with our existing data and correct it Some of the information regarding books is available on multiple catalogues all over the web (including Wikipedia) Can we benefit from this 50 Language Translation Good Enough Translation for Titles and Subjects Universal Dictionary of All Languages (Dr.Shamos) could be a starting point Google Translation Systems could help System at Xia Men University in China has already helped us do the translation We at CMU, IISc will address most of the other languages 51 Thank you Suggestions/Questions? 52 Appendix Subjects list 53 Appendix 1 catalogue Content Agriculture Agricultural engineering、agronomy、gardening、forestry、 herding、veterinary、hunting、silkworm、bee、aquatic product、fishery etc Architecture art of building 、architectural science including: architectural exploration 、architectural design 、 Architectural Structure 、soil mechanics 、building’s foundations 、Building materials 、Construction Technology 、building equipment 、regional planning 、 town planning 、public works) Art Painting 、calligraphy 、seal cutting 、photographic art 、 industrial art 、music 、dance 、drama 、cinematic 、 television art 54 Appendix 1 Astronomy Astronomy Biography Biography Biology General biology 、cytology 、genetics 、biochemistry 、 biophysics 、molecular biology 、bioengineering 、 environmental biology 、paleontology 、microbiology 、 botany 、zoology 、insect logy 、anthropology Chemistry inorganic chemistry 、organic chemistry 、 Macromolecule Chemistry| Polymer Chemistry 、 physical chemistry 、theoretical chemistry 、analytical chemistry 、applied chemistry Computer Science Automatic 、computing technique Economics Political economics 、economic profile 、economic history 、economic geography 、economic planning 、 economic management 、agricultural economy 、industry economy 、traffic and transport economy 、trade 、 marketing 55 Appendix 1 Education Education、 education at all levels 、all forms of education 、 Information & knowledge dissemination、 cultural activities Engineering General industrial Technology 、mineral engineering 、 petroleum and natural gas industry 、metallurgical industry 、 metallographic & smith craft、 machinery & meter craft 、 weapon industry 、energy industry 、atomic energy technology 、electro engineering 、radio electronics & telegraphy、 Chemical industry、 light industry & Handicraft、 Hydraulic Engineering、 Transportation、aviation & space flight Environmental Science Environmental Science Geography Human geography 、nature geography、 geophysics、 topography、meteorology、geology、oceanography 56 Appendix 1 History archaeology、 folkways Language Linguistics、 minority language、 foreign language、 all kind of language systems Literature Literary theory、 Chinese literature etc. Mathematics Mathematics Medicine Basic Medicine、 clinical medicine、 preventive medicine、 hygiene、 pharmacy etc. Military Strategy、 tactics、 military campaign、 military technology、 military geography etc. Natural Science System theory、 methodology etc. 57 Appendix 1 Philosophy Logic、 Ethics、 aesthetics etc. Physics Dynamics、 physics etc. Poetry poetry Politics & Law Diplomacy、 political relations、 law etc. Psychology Psychology Religion religion、 divination、 superstition etc. Social Science Management theory、 statistics、 sociology、 demology、 science of personnel ect. General Encyclopedia、 dictionary、 book catalogue & Abstract & indexing etc Miscellaneous Not included in above catalogue 58 For one million books.. Chinese books and English books are mainly tagged wrong out of 1 million books 59