Tri-lingual EDL Planning

Tri-lingual EDL Planning WORRY, BE HAPPY! Heng Ji (RPI) Hoa Trang Dang (NIST) Motivations: Cross-lingual KBP 2 Motivations: Cross-lingual Information Fusion Who is Jim Parsons? How is he doing lately? 3 Motivations: A Smart Cross-lingual Kindle Xi Jinping Sunnylands California China US South Sea Diaoyu Islands  Navigating Unfamiliar Languages/Domains  Education purposes 4 Current Status of EDL/EL/Wikification  English EDL attracted 20 teams  End-to-end EDL score 70%  EL: mature mono-lingual linking techniques 90% accuracy  But limited ACL papers on cross-lingual EDL/EL/Wikification  Goals  Extend from Mono-lingual to Cross-lingual  Rapid construction KB for a foreign language 5 Tri-lingual EDL Task Definition  Input:  Source Collection: English, Chinese, Spanish  KB: English only (Chinese KB and Spanish KB are disallowed)    Discourage using Inter-lingual Wikipedia links  rapid KB construction for low-density languages Output: Entity clusters presented in English, some have links to English KB  Some clusters are from single languages; and some are from multiple languages  May need to normalize NIL mention translations for ground-truth A typical system should extract entity mentions from all three languages, link them to English KB, cluster and translate NIL mentions 6 Tri-lingual Diagnostic EL Task Definition  Perfect mentions (queries) are given  Query: English, Chinese, Spanish  Some queries will be from single language only  Some queries will exist in multiple languages to form cross-lingual entity clusters 7 Source Collection  Some KBA web streaming data in English, Chinese, Spanish  Some social media data with code-switch  Some formal comparable newswire  Some discussion forum posts  Include KBP2014 EDL corpora  Larger scale than KBP2014 EDL  Share some documents with Cold-start KBP task  Maybe consider news only for 2015 8 KB: Freebase  2.6 billion triples (vs. DBPedia has 583 million triples)  Potential Problem (and Opportunity)  Some entries don’t have corresponding Wikipedia pages, so systems don’t have Wikipedia articles to analyze (similar to EL optional task before 2014)  May trigger some new research when KB doesn’t include unstructured texts 9 Resources: English  Google, LCC, IBM, RPI will run English EDL on the entire source collection  Each generate top 10 candidates for each mention, vote  Oracle linking accuracy should be above 97%  Give these to LDC as starting points to speed up human annotation/assessment  A pipeline RPI+ISI did for AMR EDL annotation (ISI has an annotation interface to correct top 10 RPI system generated candidates + Google search + …)  RPI can share English entity embeddings 10 Resources: Chinese    Softwares  Stanford Basic Chinese NLP (name tagging, parsing)  CAS Basic Chinese NLP (pos tagging, name tagging, chunking)  RPI Chinese IE (name tagging, relation, event, not-great coref/nominal) Resources  RPI has 2 million manually cleaned Chinese-English name translation pairs to share and Chinese entity embeddings  LDC has Chinese-English name dict/dicts with frequency information  LDC is developing more training data for Chinese/Spanish SF Automatic Annotation  RPI can provide Chinese name tagging and translation, and event trigger/argument extraction  BBN/IBM run Chinese IE on source collection 11 Resources: Spanish   Softwares  Dependency parser: Maltparser  Stanford Spanish CoreNLP (name tagger, …) coming in the next 6 months  Need more help from the community Automatic Annotation  IBM run Spanish ACE entity extraction (name, coref) and Parsing on source collection 12 Timeline  Release training data in May  A pilot study in May 2015   You can submit manual runs! Evaluation: September/October 2015 13 Teams with Expertise/Interest (only asked workshop attendees so far)     English/Spanish/Chinese  Yes: IBM, HITS, NYU, RPI  Maybe: JHU, LCC, BBN  … English/Chinese:  PKU, Tsinghua, a lot more Chinese teams  … English/Spanish  CSFG  Maybe: UIUC  … Speed-dating between Chinese & Spanish teams 14 Another Ambitious Proposal: Cross-lingual Slot Filling    Chinese-to-English Slot Filling  Annotation guideline available  BLENDER Pilot system (Snover et al., 2011)  Off-cycle Pilot in DEFT (Jan 2015): RPI, Univ. of Washington, Univ. of Wisconsin, CMU Spanish-to-English Slot Filling  Evaluation proposed in KBP2013  Guideline, Annotation available Tri-lingual Slot Filling 15 Other Issues  Mention Definition     Extraction for linking Nested mentions Posters Scoring   Is the current scoring reasonable? If we do EDL on 50K documents and only partial entities/documents are manually annotated, how to evaluate clustering performance?  Add new entity types in 2015: Location and Facility?  Add Non-name concepts (e.g., nominal mentions)?  Link “wife” in “Obama’s wife” to “Michelle” in KB 16 Other Issues   Evidence & Confidence  Annotation to provide evidence on NIL  System confidence/justification Correct annotation errors    Need community effort to report errors / share corrections Improve/extend annotation guidelines, check IAA Shift some annotation cost from annotating new data to knowledge resource construction?  Current research bottlenecks on coreference and slot filling are on knowledge acquisition instead of more labeled data  e.g., semantic distance between any two nominals for coreference  e.g., large-scale clean paraphrases for slot filling 17

Tri-lingual EDL Planning

Related documents

Products

Support

Tri-lingual EDL Planning

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib