Chapter 10 Cross-Language Information Retrieval Hsin-Hsi Chen (陳信希) Department of Computer Science and Information Engineering National Taiwan University Hsin-Hsi Chen 10-1 Outlines Multilingual Environments What is Cross-Language Information Retrieval? Interdisciplinary relationship in CLIR Major Problems in CLIR Major Approaches in CLIR Summary Hsin-Hsi Chen 10-2 Multilingual Collections There are 6,703 languages listed in the Ethnologue Digital libraries – OCLC Online Computer Library Center serves more than 17,000 libraries in 52 countries and contains over 30 million bibliographic records with over 500 million records ownership attached in more than 370 languages World Wide Web – Around 40% of Internet users do not speak English, however, 80% of Web sites are still in English Hsin-Hsi Chen 10-3 真實世界語言使用人口 ( http://www.g11n.com/faq.htm) Speakers (Millions) 800 600 400 200 0 Chinese 中 文 Hsin-Hsi Chen HindiUrdu 英 語 印 度 語 Portuguese 西 班 牙 語 葡 萄 牙 語 Russian 孟 加 拉 語 俄 語 Japanese 阿 拉 伯 語 日 語 10-4 荷蘭語 葡萄牙語 義大利語 韓文 西班牙語 瑞典語 中文 法語 德語 日語 Hsin-Hsi Chen (Statistics from Euro-Marketing Associates, 1998) 10-5 中文人口 比例(6.1%) < 法文人口 比例(8.8%) (1998年) Hsin-Hsi Chen (Statistics from Euro-Marketing Associates, 1999) http://www.glreach.com/globstats/ 10-6 網路世界語言使用人口 Hsin-Hsi Chen 10-7 網際網路內容 Internet Hosts (thousands) (Network Wizards Jan 99 Internet Domain Survey) 100,000 33,878 10,000 1,687 1,684 654 546 546 473 458 432 1,000 英 100 English 語 40%的Internet使用者 不懂英文,但是80% 的Internet內容是英文 Hsin-Hsi Chen German Dutch Spanish Swedish Language (estimated by domain) 日 語 德 語 法 語 荷 蘭 語 芬 蘭 語 西 班 牙 語 中 文 瑞 典 10-8 語 Hsin-Hsi Chen (Source: http://www.emarketer.com) 10-9 What is Cross-Language Information Retrieval? Definition: Select information in one language based on queries in another. Terminologies – Cross-Language Information Retrieval (ACM SIGIR 96 Workshop on Cross-Linguistic Information Retrieval) – Translingual Information Retrieval (Defense Advanced Research Project Agency DARPA) Hsin-Hsi Chen 10-10 Generalization: Multi- & Cross- Lingual Information Access Hsin-Hsi Chen 10-11 MLIR Applications Multilingual information access in multilingual country, organization, enterprise, etc. Cross- language information retrieval for users who read a second language (large passive vocabulary) but are not able to formulate good queries (small active vocabulary). Monolingual users may retrieve images by taking advantage of multilingual captions. Monolingual users may retrieve documents and have them translated (automatically or manually) in their language. Hsin-Hsi Chen 10-12 Why is Cross- Language Information Retrieval Important? More information workers with less time require fast access to global resources global B2B interactions (virtual enterprises) global B2C interactions (online trading, travelling) time critical information (translation comes too late) Hsin-Hsi Chen 10-13 History 1970 Salton runs retrieval experiments with a small English/ German dictionary 1972 Pevzner shows for English and Russian that a controlled thesaurus can be used effectively for query term translation 1978 ISO Standard 5964 for developing multilingual thesauri (revised in 1985) 1990 Latent Semantic Indexing (LSI) applied to CLIR Hsin-Hsi Chen 10-14 History (Continued) 1994 1st PhD thesis on CLIR by Khaled Radwan 1996 Similarity thesaurus applied to CLIR (ETH Zurich) 1996 Dictionary based retrieval applied to CLIR (Umass & XEROX Grenoble) 1997 Generalized Vector Space Model (GVSM) applied to CLIR (CMU) Hsin-Hsi Chen 10-15 History (Continued) 1997 CLIR (Cross- Language Information Retrieval) track starts within TREC 1998 NTCIR starts in Japan 1999 TIDES (Translingual Information Detection, Extraction, and Summarization) starts in U. S. 2000 CLEF starts in Europe Hsin-Hsi Chen 10-16 An Architecture of Multilingual Information Access Multiple Langauges Multilingual Resources Language Identification (LI) Information Extraction Information Filtering Information Retrieval Query Translation Text Classification Document Translation Text Summarization Text Processing Language Translation User Interface (UI) Native Langauge(s) Hsin-Hsi Chen 10-17 An Architecture of Cross-Language Information Retrieval Hsin-Hsi Chen 10-18 Building Blocks for CLIR Information Retrieval Information Science Hsin-Hsi Chen Artificial Intelligence Speech Recognition Computational Linguistics 10-19 Information Science User interface Interactive search technique Thesaurus construction Evaluation Hsin-Hsi Chen 10-20 Computational Linguistics Language identification Morphological analysis Stylistic analysis Part-of-speech tagging Identifying occurrences of phrases Using parallel corpora Using comparable corpora Hsin-Hsi Chen 10-21 Computational Linguistics (Continued) Aligning documents Identifying occurrences of geographic and temporal concepts Stochastic language models Word disambiguation Lexicons (morphology, part-of-speech) Bilingual dictionaries (terms and possible translation) Hsin-Hsi Chen 10-22 Information Retrieval (w/o CL) Filtering Relevance Feedback Document representation Latent semantic indexing Generalization vector space model Collection fusion Passage retrieval Hsin-Hsi Chen 10-23 Information Retrieval (Continued) Similarity thesaurus Local context analysis Automatic query expansion Fuzzy term matching Adapting retrieval methods to collection Building cheap test collection Evaluation Hsin-Hsi Chen 10-24 Artificial Intelligence Machine translation Machine learning Template extraction and matching Building large knowledge bases Semantic network Hsin-Hsi Chen 10-25 Speech Recognition Signal processing Pattern matching Phone lattice Background noise elimination Speech segmentation Modeling speech prosody Building test databases Evaluation Hsin-Hsi Chen 10-26 Building Blocks Dealing with Term Dependencies IS: ISO-Thesaurus CL: Word disambiguation, bilingual dictionaries AI: Semantic network SR: Stochastic language models IR: LSI, GVSM, similarity thesaurus, local context analysis, (weighted) Boolean filters Hsin-Hsi Chen 10-27 Major Problems of CLIR Queries and documents are in different languages. – translation Words in a query may be ambiguous. – disambiguation Queries are usually short. – expansion Hsin-Hsi Chen 10-28 Major Problems of CLIR (Continued) Queries may have to be segmented. – segmentation A document may be in terms of various languages. – language identification Hsin-Hsi Chen 10-29 Enhancing Traditional Information Retrieval Systems Which part(s) should be modified for CLIR? Documents Queries (1) (3) Document Representation Query Representation (2) (4) Comparison Hsin-Hsi Chen 10-30 Enhancing Traditional Information Retrieval Systems (Continued) (1): text translation (2): vector translation (3): query translation (4): term vector translation (1) and (2), (3) and (4): interlingual form Hsin-Hsi Chen 10-31 What are the Problems? Ambiguous terms (e.g., performance) Multiword phrases may correspond to single-word phrases (e. g. South Africa => 南非, Südafrika) Coverage of the vocabulary There is not a one-to-one mapping between two languages Translating queries automatically (lack of syntax) Translating documents automatically (performance, …) Computing mixed result lists Hsin-Hsi Chen 10-32 Cross-Language Information Retrieval Cross-Language Information Retrieval Query Translation Controlled Vocabulary Free Text Knowledge-based Ontology-based Dictionary-based Thesaurus-based Hsin-Hsi Chen Document Translation Text Translation Corpus-based Term-aligned Sentence-aligned Vector Translation Hybrid Document-aligned Parallel No Translation Unaligned Comparable 10-33 Query Translation Based CLIR English Query Hsin-Hsi Chen Translation Device Chinese Query Retrieved Chinese Documents Monolingual Chinese Retrieval System 10-34 Translating the 400 Million non-English Pages of the WWW ... would take 100’000 days (300 years) on one fast PC. Or, 1 month on 3’600 PC’s. Hsin-Hsi Chen 10-35 Controlled Vocabulary Sublanguage chosen by human indexers National Library of Medicine – Unified Medical Language System (UMLS) – Integrating medical coverage of many thesauri • English, French, German, Portuguese Hsin-Hsi Chen 10-36 Knowledge-Based Examples – Subject Thesaurus • Hierarchical and associative relations. • Unique term assigned to each node. – Concept List • Term space partitioned into concept spaces. – Term List • List of cross-language synonyms. – Lexicon • Machine readable syntax and/or semantics. Hsin-Hsi Chen 10-37 Ontology-Based Approaches Exploit complex knowledge representations e.g., EuroWordNet A Proposal for Conceptual Indexing using EuroWordNet Hsin-Hsi Chen 10-38 Ontology-Based Approaches (Continued) The Indexing Process Hsin-Hsi Chen 10-39 Dictionary-Based Approaches Exploit machine-readable dictionaries. Problems – translation ambiguity + target polysemy – coverage (unknown words, abbreviations, ...) Hsin-Hsi Chen 10-40 Dictionary-Based Approaches (Continued) Issue 1: selection strategy – Select all. – Select N randomly. – Select best N. Issue 2: which level – word – phrase Hsin-Hsi Chen 10-41 Selection Strategy: Select All Hull and Grefenstette 1996 – Take concatenation of all term translation. E: politically motivated civil disturbances F: troubles civils a caractere politique trouble - turmoil, discord, trouble, unrest, disturbance, disorder civil - civil, civilian, courteous caractere - character, nature politique - political, diplomatic, politician, policy – Original English (0.393) vs. Automatic wordbased transfer dictionary (0.235): 59.8%. – errors: multi-word expressions and ambiguity Hsin-Hsi Chen 10-42 Selection Strategy: Select All (Continued) Davis 1997 (TREC5) – Replace each English query term with all of its Spanish equivalent terms from the Collins bilingual dictionary. – Monolingual (0.2895) vs. All-equivalent substitution (0.1422): 49.12% Hsin-Hsi Chen 10-43 Evaluation Method Average Precision (5-, 9-, 11-points) Model TREC Spanish Query English Query English Query Hsin-Hsi Chen Mono IR Engine Spanish Corpus Bilingual Spanish Mono Dictionary Equivalents IR Engine POS Bilingual Dictionary Spanish Mono Equivalents IR Engine by POS TREC Spanish Corpus TREC Spanish Corpus 10-44 Selection Strategy: Select N Simple word-by-word translation – Each query term is replaced by the word or group of words given for the first sense of the term’s definition. • 50-60% drop in performance (average precision) Hsin-Hsi Chen 10-45 Selection Strategy: Select N (Continued) word/phrase translation – Take at most three translations of each word, one from each of the first three senses. Take phrase translation if appearing in dictionary. • 30-50% worse than good translation – Well-translated phrases can greatly improve effectiveness, but poorly translated phrases may negate the improvements. • WBW (0.0244), phrasal (0.0148), good phrasal (0.0610) -39.3% +150.3% Hsin-Hsi Chen 10-46 Selection Strategy: Select Best N Hayashi, Kikui and Susaki 1997 – search for a dictionary entry corresponding to the longest sequence of words from left to right – choose the most frequently used word (or phrases) in a text corpus collected from WWW – no report for this query translation approach Davis 1997 (TREC5) – POS disambiguation – Monolingual (0.2895) vs. All-equivalent substitution (0.1422) vs. POS disambiguation (0.1949): near 67.3% Hsin-Hsi Chen 10-47 Corpus-Based Approaches Categorization – – – – Term-Aligned Sentence-Aligned Document-Aligned (Parallel, Comparable) Unaligned Usage – Setup Thesaurus – Vector Mapping Hsin-Hsi Chen 10-48 Term-Aligned Corpora Fine-grained alignment in parallel corpora Oard 1996 – Term alignment is a challenging problem. English Query Parallel Translation Cooccurrance Binlingual Tables Statistics Corpus Hsin-Hsi Chen Machine Translation System Spanish Query 10-49 Sentence-Aligned Corpora Davis & Dunning 1996 (TREC4) – High-frequency Terms Hsin-Hsi Chen 10-50 Sentence-Aligned Corpora (Continued) – Statistically Significant Terms Hsin-Hsi Chen 10-51 Sentence-Aligned Corpora (Continued) Precision-Recall Averages Hsin-Hsi Chen 10-52 Document-Aligned Corpora Exploit parallel or comparable corpora Parallel: linked translation equivalents – LSI mate retrieval achieve 99% effectiveness Comparable: separate authorship, same topic – Easier to find, harder to link the documents Hsin-Hsi Chen 10-53 Query Term Disambiguation Hsin-Hsi Chen 10-54 Comparable Document-Aligned Corpora Sheridan & Ballerini 1996 – Create a comparable corpus. Align news stories in German and Italian by topic label and date, and merge them to create pseudo-parallel documents. – Generate co-occurrence thesaurus. – Perform translations using thesaurus. Hsin-Hsi Chen 10-55 Unaligned Corpora No document links Used in conjunction with dictionaries – Pretranslation Local feedback (Ballesteros & Croft 1997) Hsin-Hsi Chen 10-56 Brief Summary dictionary-based methods – Specialized vocabulary not in the dictionaries will not be translated. – Ambiguities will add extraneous terms to the query. parallel/comparable corpora-based methods – Parallel corpora are not always available. – Available corpora tend to be relative small or to cover only a small number of subjects. – Performance is dependent on how well the corpora are aligned. Hsin-Hsi Chen 10-57 Brief Summary (Continued) Dictionaries are very useful. – Achieve 50% on their own Parallel corpora have limitations. – Domain shifts – Term alignment accuracy Dictionaries and corpora are complementary. – Dictionaries provide broad and shallow coverage. – Corpora provide narrow (domain-specific) but deep (more terminology) coverage of the language. Hsin-Hsi Chen 10-58 Hybrid Methods What knowledge can be employed? – lexical knowledge – corpus knowledge – ... Hsin-Hsi Chen 10-59 Hybrid Methods (Continued) Query Expansion – Issue 1: context • pseudo relevance feedback (local feedback):: A query is modified by the addition of terms found in the top retrieved documents. • local context analysis:: Queries are expanded by the addition of the top ranked concepts from the top passages. Hsin-Hsi Chen 10-60 Hybrid Methods (Continued) – Issue 2: when • before query translation • after query translation Hsin-Hsi Chen 10-61 Pseudo- Relevance Feedback illustrated Hsin-Hsi Chen 10-62 Query Expansion through Local Context Analysis local analysis – Based on the set of documents retrieved for the original query – Based on term co-occurrence inside documents – Terms closest to individual query terms are selected global analysis – Based on the whole document collection – Based on term co-occurrence inside small contexts and phrase structures – Terms closest to the whole query are selected Hsin-Hsi Chen 10-63 Query Expansion through Local Context Analysis (Continued) candidates – noun groups instead of simple keywords – single noun, two adjacent nouns, or three adjacent nouns query expansion – Concepts are selected from the top ranked documents (as in local analysis) – Passages are used for determining cooccurrence (as in global analysis) Hsin-Hsi Chen 10-64 Query Expansion through Local Context Analysis (Continued) algorithm – Retrieve the top n ranked passages using the original query – For each concept in the top ranked passages, the similarity sim(q,c) between the whole query q and the concept c is computed using a variant of tf-idf ranking – The top m ranked concepts are added to the original query q • Each concept is assigned a weight 1-0.9i/m (i: rank) • Each term in the original query is assigned a weight 2original weight Hsin-Hsi Chen 10-65 Hybrid Methods (Continued) Ballesteros & Croft 1997 Original Spanish human English (BASE) TREC Queries translation Queries automatic dictionary translation Spanish Queries English Queries query expansion Spanish Queries Hsin-Hsi Chen query expansion INQUERY automatic dictionary translation Spanish Queries 10-66 Hybrid Methods (Continued) – Performance Evaluation • pre-translation MRD (0.0823) vs. LF (0.1099) vs. LCA10 (0.1139) +33.5% +38.5% • post-translation MRD (0.0823) vs. LF (0.0916) vs. LCA20 (0.1022) +11.3% +24.1% • combined pre- and post-translation MRD (0.0823) vs. LF (0.1242) vs. LCA20 (0.1358) +51.0% +65.0% • 32% below a monolingual baseline Hsin-Hsi Chen 10-67 Hybrid Methods (Continued) Davis 1997 (TREC5) UN English English Query Bilingual Spanish Parallel Dictionary Equivalents IR Engine Compare Document Vectors (POS) UN Spanish TREC Spanish Corpus Hsin-Hsi Chen Mono Reduced IR Engine Equivalent Set 10-68 Hybrid Methods (Continued) – corpus-based disambiguation vs. POS-based disambiguation – MONO (0.2895) vs. ALL (0.1422) vs. 49.12% CORP (0.1153) vs. POS (0.1949) vs. 39.83% 67.32% BOTH (0.2127) 73.47% Hsin-Hsi Chen 10-69 Document Translation Translate the documents, not the query Documents Queries Document Representation Query Representation MT (1) Efficiency Problem (2) Retrieval Effectiveness??? (word order, stop words) (3) Cross-language mate finding using MT-LSI (Dumais, et al, 1997) Hsin-Hsi Chen Comparison 10-70 Vector Translation Translate document vectors Documents Queries Document Representation Query Representation MT Comparison Hsin-Hsi Chen 10-71 No Translation Latent Semantic Indexing (Dumais, et al. 1997) Hsin-Hsi Chen 10-72 No Translation (Continued) Cross-Language Retrieval Using LSI resource: document-aligned parallel corpus Hsin-Hsi Chen 10-73 No Translation (Continued) Yellow Page Cross-Language Retrieval Top 1 Top 10 CL-LSI 63.8% 86.9% MT 57.5% 74.8% Hsin-Hsi Chen 10-74 A Comparative Evaluation Carbonell, Yang, Frederking, et al. (CMU,LTI) – – – – – Corpus-driven Term Translation (TMT) Pseudo-Relevance Feedback (PRF) Generalized Vector Space Model (GVSM) Latent Semantic Indexing (LSI) GVSM slightly outperforms LSI, which in turn outperforms PRF and TMT. Hsin-Hsi Chen 10-75 Research Directions Comparable corpus techniques – Automatic document linking Dictionary-based approaches – Word sense disambiguation Evaluation – Side-by-side tests – Controllable domain shift Hsin-Hsi Chen 10-76 CLIR system using query translation Hsin-Hsi Chen 10-77 Generating Mixed Ranked Lists of Documents Normalizing scales of relevance – using aligned documents – using ranks – interleaving according to given ratios Mapping documents into the same space – LSI – document translations Hsin-Hsi Chen 10-78 Tools Hsin-Hsi Chen 10-79 Types of Tools Mark-Up Tools Language Identification Stemming/Normalization Entity Recognition Part-of-Speech taggers Indexing Tools Text Alignment Speech Recognition/ OCR Visualization Hsin-Hsi Chen • Character Set/Font Handling • Word Segmentation • Phrase/Compound Handling • Terminology Extraction • Parsers/Linguistic Processors • Lexicon Acquisition • MT Systems • Summarization 10-80 Character Set/Font Handling Input and Display Support – Special input modules for e.g. Asian languages – Out-of-the-box support much improved thanks to modern web browsers Character Set/File Format – Unicode/UTF-8 – XML Hsin-Hsi Chen 10-81 Language Identification Different levels of multilingual data – In different sub-collections – Within sub-collections – Within items Different approaches – Tri-gram – Stop words – Linguistic analysis Hsin-Hsi Chen 10-82 Stemming/Normalization Reduction of words to their root form Important for languages with rich morphology Rule- based or dictionary- based Case normalization Handling of diacritics (French, …) Vowel (re-) substitution (e.g. semitic languages, …) Hsin-Hsi Chen 10-83 Entity Recognition/ Terminology Extraction Proper Names, Locations, ... – Critical, since often missing from dictionaries – Special problems in languages such as Chinese Domain- specific vocabulary, technical terms – Critical for effectiveness and accuracy Hsin-Hsi Chen 10-84 Phrase/Compound Handling Collocations (“Hong Kong“) – Important for dictionary lookup – Improves retrieval accuracy Compounds (“Bankangestelltenlohn“ –bank employee salary) – Big problem in German – Infinite number of compounds – dictionary is no viable solution Hsin-Hsi Chen 10-85 Lexicon Acquisition/ Text Alignment Goal: automatic construction of data structures such as dictionaries and thesauri – Work on parallel and comparable corpora – Terminology extraction – Similarity thesauri Prerequisite: training data, usually aligned – Document, sentence, word level alignment Hsin-Hsi Chen 10-86 CLIR Evaluation at TREC Hsin-Hsi Chen 10-87 Too many factors in CLIR system evaluation translation automatic relevance feedback term expansion disambiguation result merging test collection need to tone it down to see what happened Hsin-Hsi Chen 10-88 TREC-6 Cross-Language Track In cooperation with the Swiss Federal Institute of Technology (ETH) Task Summary: retrieval of English, French, and German documents, both in a monolingual and a cross-lingual mode Documents – SDA (1988-1990): French (250MB), German (330 MB) – Neue Zurcher Zeitung (1994): German (200MB) – AP (1988-1990): English (759MB) 13 participating groups Hsin-Hsi Chen 10-89 TREC-7 Cross-Language Track Task Summary: retrieval of English, French, German and Italian documents Results to be returned as a single multilingual ranked list Addition of Italian SDA (1989-1990), 90 MB Addition of a subtask of 31,000 structured German social science documents (GIRT) 9 participating groups Hsin-Hsi Chen 10-90 TREC-8 Cross-Language Track Tasks, documents and topic creation similar to TREC-7 12 participating groups Hsin-Hsi Chen 10-91 CLIR in TREC-9 Documents – Hong Kong Commercial Daily, Hong Kong Daily News, Takungpao: all from 1999 and about 260 MB total 25 new topics built in English; translations made to Chinese Hsin-Hsi Chen 10-92 Cross-Language Evaluation Forum A collaboration between the DELOS Network of Excellence for Digital Libraries and the US National Institute for Standards and Technology (NIST) Extension of CLIR track at TREC (19971999) Hsin-Hsi Chen 10-93 Main Goals Promote research in cross-language system development for European languages by providing an appropriate infrastructure for: – CLIR system evaluation, testing and tuning – Comparison and discussion of results Hsin-Hsi Chen 10-94 CLEF 2000 Task Description Four evaluation tracks in CLEF 2000 – multilingual information retrieval – bilingual information retrieval – monolingual (non-English) information retrieval – domain-specific IR Hsin-Hsi Chen 10-95 CLEF 2000 Document Collection Multilingual Comparable Corpus – – – – English: Los Angeles Times French: Le Monde German: Frankfurter Rundschau+Der Speigel Italian: La Stampa Similar for genre, content, time Hsin-Hsi Chen 10-96 Case Study: CLIR for NPDM Hsin-Hsi Chen 10-97 3M in Digital Libraries/Museums Multi-media – Selecting suitable media to represent contents Multi-linguality – Decreasing the language barriers Multi-culture – Integrating multiple cultures Hsin-Hsi Chen 10-98 NPDM Project Palace Museum, Taipei, one of the famous museums in the world NSC supports a pioneer study of a digital museum project NPDM starting from 2000 – Enamels from the Ming and Ch’ing Dynasties – Famous Album Leaves of the Sung Dynasty – Illustrations in Buddhist Scriptures with Relative Drawings Hsin-Hsi Chen 10-99 Design Issues Standardization – A standard metadata protocol is indispensable for the interchange of resources with other museums. Multimedia – A suitable presentation scheme is required. Internationalization – to share the valuable resources of NPDM with users of different languages – to utilize knowledge presented in a foreign language Hsin-Hsi Chen 10-100 Translingual Issue CLIR – to allow users to issue queries in one language to access documents in another language – the query language is English and the document language is Chinese Two common approaches – Query translation – Document translation Hsin-Hsi Chen 10-101 Resources in NPDM pilot an enamel, a calligraphy, a painting, or an illustration MICI-DC – Metadata Interchange for Chinese Information – Accessible fields to users • Short descriptions vs. full texts • Bilingual versions vs. Chinese only – Fields for maintenance only Hsin-Hsi Chen 10-102 Search Modes Free search – users describe their information need using natural languages (Chinese or English) Specific topic search – users fill in specific fields denoting authors, titles, dates, and so on Hsin-Hsi Chen 10-103 Example Information need – Retrieval “Travelers Among Mountains and Streams, Fan K‘uan” (“范寬谿山行旅圖”) Possible queries – Author: Fan Kuan; Kuan, Fan – Time: Sung Dynasty – Title: Mountains and Streams; Travel among mountains; Travel among streams; Mountain and stream painting – Free search: landscape painting; travelers, huge mountain, Nature; scenery; Shensi province Hsin-Hsi Chen 10-104 English Query Document Translation Query Translation English Names Name Search Specific Bilingual Dictionary Machine Transliteration Chinese Names English Titles Query Disambiguation Title Search Generic Bilingual Dictionary Chinese Titles Chinese Query NPDM Collection Chinese IR System Hsin-Hsiin Chen ECIR NPDM 10-105 Results Specific Topic Search proper names are important query terms – Creators such as “林逋” (Lin P’u), “李建中” (Li Chien-chung), “歐陽脩” (Ou-yang Hsiu), etc. – Emperors such as “康熙” (K'ang-hsi), “乾隆” (Ch'ien-lung), “徽宗” (Hui-tsung), etc. – Dynasty such as ”宋” (Sung), “明” (Ming), “清” (Ch’ing), etc. Hsin-Hsi Chen 10-106 Name Transliteration The alphabets of Chinese and English are totally different Wade-Giles (WG) and Pinyin are two famous systems to romanize Chinese in libraries backward transliteration – Transliterate target language terms back to source language ones – Chen, Huang, and Tsai (COLING, 1998) – Lin and Chen (ROCLING, 2000) Hsin-Hsi Chen 10-107 Name Mapping Table Divide a name into a sequence of Chinese characters, and transform each character into phonemes Look up phoneme-to-WG (Pinyin) mapping table, and derive a canonical form for the name Example – “林逋” “ㄌㄧㄣ ㄆㄨ” “Lin P’u” (WG) Hsin-Hsi Chen 10-108 Name Similarity Extract named entity from the query Select the most similar named entity from name mapping table Naming sequence/scheme – LastName FirstName1, e.g., Chu Hsi (朱熹) – FirstName1 LastName, e.g., Hsi Chu (朱熹) – LastName FirstName1-FirstName2, e.g., Hsu Tao-ning (許道寧) – FirstName1-FirstName2 LastName, e.g., Tao-ning Hsu (許道寧) – Any order, e.g., Tao Ning Hsu (許道寧) – Any transliteration, e.g., Ju Shi (朱熹) Hsin-Hsi Chen 10-109 Title 谿山行旅圖” “Travelers among Mountains and Streams” "travelers", "mountains", and "streams" are basic components Users can express their information need through the descriptions of a desired art System will measure the similarity of art titles (descriptions) and a query Hsin-Hsi Chen 10-110 Free Search A query is composed of several concepts. Concepts are either transliterated or translated. The query translation similar to a small scale IR system Resources – – – – – Name-mapping table Title-mapping table Specific English-Chinese Dictionary Generic English-Chinese Dictionary … Hsin-Hsi Chen 10-111 Algorithm (1) For each resource, the Chinese translations whose scores are larger than a specific threshold are selected. (2) The Chinese translations identified from different resources are merged, and are sorted by their scores. (3) Consider the Chinese translation with the highest score in the sorting sequence. – If the intersection of the corresponding English description and query is not empty, then select the translation and delete the common English terms between query and English description from query. – Otherwise, skip the Chinese translation. Hsin-Hsi Chen 10-112 Algorithm (Continued) (4) Repeat step (3) until query is empty or all the Chinese translations in the sorting sequence are considered. (5) If the query is not empty, then these words are looked up from the general dictionary. A Chinese query is composed of all the translated results. Hsin-Hsi Chen 10-113