Automatic Indexing with the EuroVoc Thesaurus Enabling Cross-lingual Search Marie Francine Moens Katholieke Universiteit Leuven, Belgium Frane Šarić University of Zagreb, FER, Croatia 18-19 November 2010, Luxembourg – Kirchberg 1 CADIAL project Computer Aided Document Indexing for Accessing Legislation A joint Flemish-Croatian project Partners: Katholieke Universiteit Leuven (prof. Marie-Francine Moens) University of Zagreb & Hidra (prof. Bojana Dalbelo Bašić, prof. Marko Tadić) Goal: publicly accessible service for automatic indexing of the official documentation of the Republic of Croatia 2 CADIAL project (cont.) 1. Manually index 10.000 documents eCADIS – semi-automatic document indexing 2. Use that data to train automatic indexers Trained automatic classifiers for every EuroVoc descriptor 3. Provide indexed data to custom search engine CADIAL search engine 3 eCADIS Computer Aided Document Indexing System Provides useful information that helps indexers index documents more quickly Counts n-grams Includes word normalization Extracts collocations Suggests appropriate descriptors Uses automatically trained classifiers 4 eCADIS (cont.) 5 Morphological normalization Croatian = morphologically complex language Inflectional variation Derivational variation 6 Morphological normalization Lexicon-based normalization [Snajder et al. 2009]: Inflectional and derivational rules String transformation functions Higher order functional representation of Croatian inflectional morphology: Inflectional rules Transformations: higher-order functions 7 Named entity recognition Named entity recognition = semantic classification of entity name (usually proper name) [Bekavac & Tadic 2009]: Person, location, organization, date, ... Use of lists of names and use of finite state automata 8 Lexical association metrics Collocation: meaning of a compound term cannot be inferred from meaning its individual terms Collocations are valuable index terms Several methods were developed: Based on extraction of terms in Wikipedia that are linked filtered by acceptable Part-Of-Speech patterns [Bekavac & Tadic 2009] Terme-X: use of lexical association measures to build a dictionary of collocations filtered by acceptable Part-OfSpeech patterns (e.g., chi-square, log likelihood ration for a binomial distribution, pointwise mutual information statistic [Delac et al. 2009] Using a genetic programming algorithm for learning a language adapted lexical association measure [Snajder et al. 2009] 9 Text categorization = Assignment of terms of the EUROVOC thesaurus Currently done at the statute level Problem Large number of features (terms) and often few training examples => feature selection: chi square, frequent item sets, linear classifier weights, ... [Boiy & Moens 2009] Use of common classification algorithms: support vector machines, logistic regression, ... [Saric et al. in preparation] EUROVOC = multilingual => terms can be used in crosslingual retrieval 10 11 Text categorization Core of the CADIAL project System suggests index terms to the human indexers High performance of the categorization: e.g., in the 80% F1 measure As number of categorized documents grow, we hope to learn better classification models Possibility to exploit the hierarchical organization of the thesaurus term to improve accuracy of the categorization 12 13 [Bennett & Nguyen 2009] 14 TMT: Object-oriented text classification library 15 Comparing document classification schemes Problem: discrepancy of classification scheme (e.g. EUROVOC thesaurus) and natural clusters formed by the documents How to find this discrepancy so that the classification scheme can be adapted? [Silic et al. 2009] Finding an optimal clustering and comparison with the clusters formed by the documents classified built by ground truth categories of documents Dimensionality reduction with principal component analysis (PCA): visualization of the clusters 16 17 CADIAL Search Engine http://cadial.hidra.hr Full text search over a collection of 20,000 legal documents Documents are automatically indexed using EuroVoc descriptors Hidra assures that additional metadata is correct: Regulation status (valid / invalid) Area of activity EU accession chapter 18 The CADIAL search engine Possibility to search: Full text Titles EUROVOC thesaurus terms Historical versions ... Legislation: semi-structured documents: possibility to take the structure into account when computing the relevance of article, section etc. Successful participation at the INEX competition 2008 [Mijic et al. 2009] 19 CADIAL Search – live demo 20 CADIAL Search – document metadata 21 Towards cross-lingual search 22 Cross-lingual indexing Classification/indexing of documents = supervised machine learning of the classification patterns based on annotated training examples When multilingual documents are not linked: Demands manual annotation in different languages Can be important manual effort: Changing collections, taxonomies Many official languages in the EU Transfer learning can be solution 23 Cross-lingual indexing Potential of transfer learning techniques [Pan & Yang IEEE TKDE 2010] Co-training and co-regularization techniques for learning classification patterns from documents in multiple languages [Amini et al. SIGIR 2010] 24 Conclusions CADIAL = valuable example of automatic indexing enabling cross-lingual search EUROVOC thesaurus is a valuable resource Many different future tracks of research aiming at more flexible and accurate indexing http://www.cadial.org/ 25 The CADIAL project has received the 2009 Prime Minister Award for special achievements in the field of e-Government in Croatia and the 2009 "Golden Tesla's Egg" Award of the VIDI publishing house for the best innovative solution in ICT for the category Academic Institutions. The project was invited to participate at CeBIT 2010, the world's foremost tradeshow for the digital industry. 26