Mandarin-English Information (MEI): Investigating Translingual Speech Retrieval Johns Hopkins University Summer Workshop 2000 Presented at the ANLP-NAACL 2000 Embedded Machine Translation Systems Workshop The MEI Team MEI Team • Senior Members Helen Meng Erika Grams Sanjeev Khudanpur Gina-Anne Levow Douglas Oard Patrick Schone Hsin-Min Wang Chinese University of Hong Kong Advanced Analytic Tools Johns Hopkins University University of Maryland University of Maryland US Department of Defense Academia Sinica, Taiwan • Students Berlin Chen Wai-Kit Lo Karen Tang Jianqiang Wang National Taiwan University Chinese University of Hong Kong Princeton University University of Maryland Outline • Audio indexing • MEI Project overview • Research challenges • System architecture • Collaboration opportunities Motivation • Speech retrieval applications are emerging – e.g., http://speechbot.research.compaq.com • Internet-accessible Radio and Television Stations 529 English 1367 Other Languages source: www.real.com, Feb 2000 The Big Picture MEI Translingual Audio Search Translingual Audio Browsing Select English Query Speech to Speech Translation Examine English Audio Related Work • TREC Spoken Document Retrieval – Close coupling of recognition and retrieval • TREC Cross-Language Retrieval – Close coupling of translation and retrieval • TDT-3 Topic Tracking – Coupling recognition, translation and retrieval • Using speech recognition transcripts The MEI Project • Closely couple recognition and translation – For the purpose of retrieval • Using English examples, find Mandarin audio English Example Newswire Stories Mandarin Audio Collection Query by Example Research Challenges • Multi-scale audio indexing – Multiple feature sets capture more information • Multi-scale translation – Lexicon and pronunciation are complementary • Multi-scale retrieval – Combination of evidence can add robustness Multi-scale Mandarin Audio Indexing Preme/Toneme Preme/Core Final Initial/Final /j/ /i/ /ji/ /j/ /ng/ /a/ /ang/ /iang/ Multi-scale Translation • Word-scale – Dictionary-based [Levow & Oard 00] – Parallel corpora [Nie 99] – Comparable corpora [Fung 98] • Subword-scale [Knight & Graehl 97] – Cross-language phonetic mapping – /bei2 ai4 er3 lan2/ • Kosovo (/ke1-sou3-wo4/, /ke1-sou3-fo2/, /ke1-sou3-fu1/, /ke1-sou3-fu2/) Cross-Language Phonetic Mapping • Syllabify English spelling – e.g. Jiang Zemin, Shandong Province • Map English pronunciation to Mandarin – Convert phonemes to pinyin • e.g. /k ow s ax v ow/ to /ke1-suo3-wo4/ – Plan to investigate alternative techniques • Rule-based • Statistical mapping Multi-scale Retrieval • Word-scale exploits lexical knowledge – Enhances precision • Subwords can achieve complete coverage – Enhances recall • Combination of evidence may be best – If a good merging strategy can be found Multi-scale Retrieval Techniques • Subword-scale – Syllable lattice matching [Chen, Wang & Lee 00] – Overlapping syllable n-grams [Meng et al. 99] – Syllable confusion matrix [Meng et al. 99] • Word-scale – Structured queries [Pirkola 98] – Structured translation [Sperer & Oard 00] Merging Strategies • Loose coupling – Separate retrieval runs – Merge ranked lists [Voorhees 95] • Tight coupling [Ng 00] – Unified indexing of words and subwords – Single ranked list Robust Retrieval • Multiple causes – Speech recognition errors – Translation ambiguity – Transliteration ambiguity • Possible solutions – – – – – Weighted n-best indexing [Levow & Oard 00] Syllable lattice indexing [Chen, Wang & Lee 00] Syllable confusion expansion [Meng et al. 99] Structured queries [Pirkola 98] Document expansion [Levow & Oard 00] System Architecture Overview Mandarin Documents Words Corpus Statistics Known Terms English Example Relevance Judgments Translation Lexicon Word Translation Phonetic Transcription Syllable n-grams Retrieval System Syllable n-gram Generation Eval Code Average Precision The TDT Collections • Four stories per topic in each language – Each reporting on some aspect of one event Development Test (TDT-2) Mar 98 Jan 98 Jun 98 41 Hours VOA Mandarin Audio Evaluation (TDT-3) Oct 98 Dec 98 121 Hours Voice of America (VOA) Mandarin Audio APW+NYT English Associated Press (APW) New York Times (NYT) English Newswire 20 Topics 59 Topics Story Boundaries Known Condition MEI Project Schedule Dec Feb Six Weeks at Hopkins: Apr Jun Aug Things We Need • Ideas – To sharpen our focus • Connections – To build a community of interest • Resources – To build on what others have done For More Information • MEI Project – http://www.glue.umd.edu/~meiweb • Translingual Retrieval – http://www.clis.umd.edu/dlrg/clir • Speech Retrieval – http://www.clis.umd.edu/dlrg/speech • Hopkins Summer Workshop Series – http://www.clsp.jhu.edu/workshops Detailed Query Processing (1) List of translatable words and phrases White-space separated text with named entity tags. Stopping Stemming Phrase Extraction Terms that have Mandarin translations Named entities terms with no Mandarin translations Detailed Query Processing (2) Terms that have Mandarin translations Named entity parsing rules for transliteration Named entities phonetic expansion Bag of Mandarin terms English pronunciation lexicon eh n t ih t iy t eh k s t English-Mandarin translation lexicon Northern Named entity parsing Ireland Terms with no Mandarin translations term translation xxxxx xx xxx xxx zzzzzzz ww xx ww z rrrrrr sss ttttt tt ww www term translation phonetic expansion Bag of English phone sequences Detailed Query Processing (3) no Bag of Mandarin terms Retain this term? To trash (or downweight) yes ASR Insertion prone words ASR substitution/deletion prone words Syllabic expansion Bag of English phone sequences Mandarin syllabification rules English phone strings to Mandarin syllables Mandarin pronunciation lexicon xxxxx xxx zzzzzzz xx ww z rrrrrr sss ttttt tt ww www Smaller bag of Mandarin terms s1 s2 s2 s5 s1 s2 s3 Two bags of Mandarin syllable sequences sa sb sc sd sc se Detailed Query Processing (4) Bag of high-confidence Mandarin terms Syllable n-gram generation syllabic expansion Mandarin pronunciation lexicon Mandarin syllable sequences from likely recognition errors Mandarin syllable sequences from unknown words Syllable n-gram generation Syllable n-gram generation s0 s1 s1 s2 s2 s3 s3 s4 s3 s5 s5 s6 s5 s7 s7 s7 s1 s2 s2 s3 s2 s5 s5 s1 sa sb sb sc sc sd sc se xxxxx xxx zzzzzzz xx ww z rrrrrr sss ttttt tt ww www Bag of Mandarin lexical terms Three Bags of Mandarin syllable n-grams from different sources