Tri-lingual EDL Planning WORRY, BE HAPPY! Heng Ji (RPI) Hoa Trang Dang (NIST) Motivations: Cross-lingual KBP 2 Motivations: Cross-lingual Information Fusion Who is Jim Parsons? How is he doing lately? 3 Motivations: A Smart Cross-lingual Kindle Xi Jinping Sunnylands California China US South Sea Diaoyu Islands Navigating Unfamiliar Languages/Domains Education purposes 4 Current Status of EDL/EL/Wikification English EDL attracted 20 teams End-to-end EDL score 70% EL: mature mono-lingual linking techniques 90% accuracy But limited ACL papers on cross-lingual EDL/EL/Wikification Goals Extend from Mono-lingual to Cross-lingual Rapid construction KB for a foreign language 5 Tri-lingual EDL Task Definition Input: Source Collection: English, Chinese, Spanish KB: English only (Chinese KB and Spanish KB are disallowed) Discourage using Inter-lingual Wikipedia links rapid KB construction for low-density languages Output: Entity clusters presented in English, some have links to English KB Some clusters are from single languages; and some are from multiple languages May need to normalize NIL mention translations for ground-truth A typical system should extract entity mentions from all three languages, link them to English KB, cluster and translate NIL mentions 6 Tri-lingual Diagnostic EL Task Definition Perfect mentions (queries) are given Query: English, Chinese, Spanish Some queries will be from single language only Some queries will exist in multiple languages to form cross-lingual entity clusters 7 Source Collection Some KBA web streaming data in English, Chinese, Spanish Some social media data with code-switch Some formal comparable newswire Some discussion forum posts Include KBP2014 EDL corpora Larger scale than KBP2014 EDL Share some documents with Cold-start KBP task Maybe consider news only for 2015 8 KB: Freebase 2.6 billion triples (vs. DBPedia has 583 million triples) Potential Problem (and Opportunity) Some entries don’t have corresponding Wikipedia pages, so systems don’t have Wikipedia articles to analyze (similar to EL optional task before 2014) May trigger some new research when KB doesn’t include unstructured texts 9 Resources: English Google, LCC, IBM, RPI will run English EDL on the entire source collection Each generate top 10 candidates for each mention, vote Oracle linking accuracy should be above 97% Give these to LDC as starting points to speed up human annotation/assessment A pipeline RPI+ISI did for AMR EDL annotation (ISI has an annotation interface to correct top 10 RPI system generated candidates + Google search + …) RPI can share English entity embeddings 10 Resources: Chinese Softwares Stanford Basic Chinese NLP (name tagging, parsing) CAS Basic Chinese NLP (pos tagging, name tagging, chunking) RPI Chinese IE (name tagging, relation, event, not-great coref/nominal) Resources RPI has 2 million manually cleaned Chinese-English name translation pairs to share and Chinese entity embeddings LDC has Chinese-English name dict/dicts with frequency information LDC is developing more training data for Chinese/Spanish SF Automatic Annotation RPI can provide Chinese name tagging and translation, and event trigger/argument extraction BBN/IBM run Chinese IE on source collection 11 Resources: Spanish Softwares Dependency parser: Maltparser Stanford Spanish CoreNLP (name tagger, …) coming in the next 6 months Need more help from the community Automatic Annotation IBM run Spanish ACE entity extraction (name, coref) and Parsing on source collection 12 Timeline Release training data in May A pilot study in May 2015 You can submit manual runs! Evaluation: September/October 2015 13 Teams with Expertise/Interest (only asked workshop attendees so far) English/Spanish/Chinese Yes: IBM, HITS, NYU, RPI Maybe: JHU, LCC, BBN … English/Chinese: PKU, Tsinghua, a lot more Chinese teams … English/Spanish CSFG Maybe: UIUC … Speed-dating between Chinese & Spanish teams 14 Another Ambitious Proposal: Cross-lingual Slot Filling Chinese-to-English Slot Filling Annotation guideline available BLENDER Pilot system (Snover et al., 2011) Off-cycle Pilot in DEFT (Jan 2015): RPI, Univ. of Washington, Univ. of Wisconsin, CMU Spanish-to-English Slot Filling Evaluation proposed in KBP2013 Guideline, Annotation available Tri-lingual Slot Filling 15 Other Issues Mention Definition Extraction for linking Nested mentions Posters Scoring Is the current scoring reasonable? If we do EDL on 50K documents and only partial entities/documents are manually annotated, how to evaluate clustering performance? Add new entity types in 2015: Location and Facility? Add Non-name concepts (e.g., nominal mentions)? Link “wife” in “Obama’s wife” to “Michelle” in KB 16 Other Issues Evidence & Confidence Annotation to provide evidence on NIL System confidence/justification Correct annotation errors Need community effort to report errors / share corrections Improve/extend annotation guidelines, check IAA Shift some annotation cost from annotating new data to knowledge resource construction? Current research bottlenecks on coreference and slot filling are on knowledge acquisition instead of more labeled data e.g., semantic distance between any two nominals for coreference e.g., large-scale clean paraphrases for slot filling 17