Mandarin-English Information (MEI)

advertisement
Mandarin-English Information (MEI):
Investigating Translingual Speech Retrieval
Johns Hopkins University Summer Workshop 2000
Presented at the
ANLP-NAACL 2000
Embedded Machine Translation Systems Workshop
The MEI Team
MEI Team
• Senior Members
Helen Meng
Erika Grams
Sanjeev Khudanpur
Gina-Anne Levow
Douglas Oard
Patrick Schone
Hsin-Min Wang
Chinese University of Hong Kong
Advanced Analytic Tools
Johns Hopkins University
University of Maryland
University of Maryland
US Department of Defense
Academia Sinica, Taiwan
• Students
Berlin Chen
Wai-Kit Lo
Karen Tang
Jianqiang Wang
National Taiwan University
Chinese University of Hong Kong
Princeton University
University of Maryland
Outline
• Audio indexing
• MEI Project overview
• Research challenges
• System architecture
• Collaboration opportunities
Motivation
• Speech retrieval applications are emerging
– e.g., http://speechbot.research.compaq.com
• Internet-accessible
Radio and
Television
Stations
529
English
1367
Other
Languages
source: www.real.com, Feb 2000
The Big Picture
MEI
Translingual
Audio Search
Translingual
Audio Browsing
Select
English
Query
Speech to Speech
Translation
Examine
English
Audio
Related Work
• TREC Spoken Document Retrieval
– Close coupling of recognition and retrieval
• TREC Cross-Language Retrieval
– Close coupling of translation and retrieval
• TDT-3 Topic Tracking
– Coupling recognition, translation and retrieval
• Using speech recognition transcripts
The MEI Project
• Closely couple recognition and translation
– For the purpose of retrieval
• Using English examples, find Mandarin audio
English
Example
Newswire
Stories
Mandarin
Audio
Collection
Query by
Example
Research Challenges
• Multi-scale audio indexing
– Multiple feature sets capture more information
• Multi-scale translation
– Lexicon and pronunciation are complementary
• Multi-scale retrieval
– Combination of evidence can add robustness
Multi-scale Mandarin Audio Indexing
Preme/Toneme
Preme/Core Final
Initial/Final
/j/
/i/
/ji/
/j/
/ng/
/a/
/ang/
/iang/
Multi-scale Translation
• Word-scale
– Dictionary-based [Levow & Oard 00]
– Parallel corpora [Nie 99]
– Comparable corpora [Fung 98]
• Subword-scale [Knight & Graehl 97]
– Cross-language phonetic mapping
–
/bei2 ai4 er3 lan2/
• Kosovo (/ke1-sou3-wo4/, /ke1-sou3-fo2/, /ke1-sou3-fu1/,
/ke1-sou3-fu2/)
Cross-Language Phonetic Mapping
• Syllabify English spelling
– e.g. Jiang Zemin, Shandong Province
• Map English pronunciation to Mandarin
– Convert phonemes to pinyin
• e.g. /k ow s ax v ow/ to /ke1-suo3-wo4/
– Plan to investigate alternative techniques
• Rule-based
• Statistical mapping
Multi-scale Retrieval
• Word-scale exploits lexical knowledge
– Enhances precision
• Subwords can achieve complete coverage
– Enhances recall
• Combination of evidence may be best
– If a good merging strategy can be found
Multi-scale Retrieval Techniques
• Subword-scale
– Syllable lattice matching [Chen, Wang & Lee 00]
– Overlapping syllable n-grams [Meng et al. 99]
– Syllable confusion matrix [Meng et al. 99]
• Word-scale
– Structured queries [Pirkola 98]
– Structured translation [Sperer & Oard 00]
Merging Strategies
• Loose coupling
– Separate retrieval runs
– Merge ranked lists [Voorhees 95]
• Tight coupling [Ng 00]
– Unified indexing of words and subwords
– Single ranked list
Robust Retrieval
• Multiple causes
– Speech recognition errors
– Translation ambiguity
– Transliteration ambiguity
• Possible solutions
–
–
–
–
–
Weighted n-best indexing [Levow & Oard 00]
Syllable lattice indexing [Chen, Wang & Lee 00]
Syllable confusion expansion [Meng et al. 99]
Structured queries [Pirkola 98]
Document expansion [Levow & Oard 00]
System Architecture Overview
Mandarin Documents
Words
Corpus
Statistics
Known
Terms
English
Example
Relevance
Judgments
Translation
Lexicon
Word
Translation
Phonetic
Transcription
Syllable
n-grams
Retrieval
System
Syllable n-gram
Generation
Eval
Code
Average
Precision
The TDT Collections
• Four stories per topic in each language
– Each reporting on some aspect of one event
Development Test (TDT-2)
Mar 98
Jan 98
Jun 98
41 Hours
VOA
Mandarin
Audio
Evaluation (TDT-3)
Oct 98
Dec 98
121 Hours
Voice of America (VOA)
Mandarin Audio
APW+NYT
English
Associated Press (APW)
New York Times (NYT)
English Newswire
20 Topics
59 Topics
Story
Boundaries
Known
Condition
MEI Project Schedule
Dec
Feb
Six Weeks at Hopkins:
Apr
Jun
Aug
Things We Need
• Ideas
– To sharpen our focus
• Connections
– To build a community of interest
• Resources
– To build on what others have done
For More Information
• MEI Project
– http://www.glue.umd.edu/~meiweb
• Translingual Retrieval
– http://www.clis.umd.edu/dlrg/clir
• Speech Retrieval
– http://www.clis.umd.edu/dlrg/speech
• Hopkins Summer Workshop Series
– http://www.clsp.jhu.edu/workshops
Detailed Query Processing (1)
List of
translatable words
and phrases
White-space
separated
text with named
entity tags.
Stopping
Stemming
Phrase Extraction
Terms that
have Mandarin
translations
Named
entities
terms with no
Mandarin
translations
Detailed Query Processing (2)
Terms that have
Mandarin translations
Named entity
parsing rules for
transliteration
Named
entities
phonetic
expansion
Bag of
Mandarin terms
English
pronunciation
lexicon
eh n t ih t iy
t eh k s t
English-Mandarin
translation lexicon
Northern
Named entity
parsing
Ireland
Terms with no
Mandarin translations
term
translation
xxxxx
xx xxx
xxx
zzzzzzz
ww
xx ww z
rrrrrr
sss
ttttt tt
ww www
term
translation
phonetic
expansion
Bag of
English
phone sequences
Detailed Query Processing (3)
no
Bag of
Mandarin terms
Retain
this term?
To trash (or downweight)
yes
ASR
Insertion prone
words
ASR
substitution/deletion
prone words
Syllabic
expansion
Bag of
English
phone
sequences
Mandarin
syllabification
rules
English phone strings
to
Mandarin syllables
Mandarin
pronunciation
lexicon
xxxxx
xxx
zzzzzzz
xx ww z
rrrrrr
sss
ttttt tt
ww www
Smaller bag of
Mandarin terms
s1 s2
s2 s5 s1
s2 s3
Two bags of Mandarin
syllable sequences
sa sb sc sd
sc se
Detailed Query Processing (4)
Bag of
high-confidence
Mandarin terms
Syllable n-gram
generation
syllabic
expansion
Mandarin
pronunciation
lexicon
Mandarin syllable
sequences from
likely recognition errors
Mandarin syllable
sequences from
unknown words
Syllable n-gram
generation
Syllable n-gram
generation
s0 s1
s1 s2
s2 s3
s3 s4
s3 s5
s5 s6
s5 s7
s7 s7
s1 s2
s2 s3
s2 s5
s5 s1
sa sb
sb sc
sc sd
sc se
xxxxx
xxx
zzzzzzz
xx ww z
rrrrrr
sss
ttttt tt
ww www
Bag of Mandarin
lexical terms
Three Bags of
Mandarin syllable
n-grams from
different sources
Download