Tri-lingual EDL Planning

advertisement
Tri-lingual EDL Planning
WORRY, BE HAPPY!
Heng Ji (RPI)
Hoa Trang Dang (NIST)
Motivations: Cross-lingual KBP
2
Motivations: Cross-lingual
Information Fusion
Who is Jim Parsons? How is he doing lately?
3
Motivations: A Smart Cross-lingual Kindle
Xi Jinping
Sunnylands
California
China
US
South Sea
Diaoyu Islands

Navigating Unfamiliar Languages/Domains

Education purposes
4
Current Status of EDL/EL/Wikification

English EDL attracted 20 teams

End-to-end EDL score 70%

EL: mature mono-lingual linking techniques 90% accuracy

But limited ACL papers on cross-lingual EDL/EL/Wikification

Goals

Extend from Mono-lingual to Cross-lingual

Rapid construction KB for a foreign language
5
Tri-lingual EDL Task Definition

Input:

Source Collection: English, Chinese, Spanish

KB: English only (Chinese KB and Spanish KB are disallowed)



Discourage using Inter-lingual Wikipedia links  rapid KB
construction for low-density languages
Output: Entity clusters presented in English, some have links
to English KB

Some clusters are from single languages; and some are from multiple
languages

May need to normalize NIL mention translations for ground-truth
A typical system should extract entity mentions from all three
languages, link them to English KB, cluster and translate NIL
mentions
6
Tri-lingual Diagnostic EL Task Definition

Perfect mentions (queries) are given

Query: English, Chinese, Spanish

Some queries will be from single language only

Some queries will exist in multiple languages to form cross-lingual
entity clusters
7
Source Collection

Some KBA web streaming data in English, Chinese, Spanish

Some social media data with code-switch

Some formal comparable newswire

Some discussion forum posts

Include KBP2014 EDL corpora

Larger scale than KBP2014 EDL

Share some documents with Cold-start KBP task

Maybe consider news only for 2015
8
KB: Freebase

2.6 billion triples (vs. DBPedia has 583 million triples)

Potential Problem (and Opportunity)

Some entries don’t have corresponding Wikipedia pages,
so systems don’t have Wikipedia articles to analyze
(similar to EL optional task before 2014)

May trigger some new research when KB doesn’t include
unstructured texts
9
Resources: English

Google, LCC, IBM, RPI will run English EDL on the entire source
collection

Each generate top 10 candidates for each mention, vote

Oracle linking accuracy should be above 97%

Give these to LDC as starting points to speed up human
annotation/assessment

A pipeline RPI+ISI did for AMR EDL annotation (ISI has an annotation
interface to correct top 10 RPI system generated candidates + Google
search + …)

RPI can share English entity embeddings
10
Resources: Chinese



Softwares

Stanford Basic Chinese NLP (name tagging, parsing)

CAS Basic Chinese NLP (pos tagging, name tagging, chunking)

RPI Chinese IE (name tagging, relation, event, not-great coref/nominal)
Resources

RPI has 2 million manually cleaned Chinese-English name
translation pairs to share and Chinese entity embeddings

LDC has Chinese-English name dict/dicts with frequency
information

LDC is developing more training data for Chinese/Spanish SF
Automatic Annotation

RPI can provide Chinese name tagging and translation, and event
trigger/argument extraction

BBN/IBM run Chinese IE on source collection
11
Resources: Spanish


Softwares

Dependency parser: Maltparser

Stanford Spanish CoreNLP (name tagger, …) coming in the next 6 months

Need more help from the community
Automatic Annotation

IBM run Spanish ACE entity extraction (name, coref) and Parsing on
source collection
12
Timeline

Release training data in May

A pilot study in May 2015


You can submit manual runs!
Evaluation: September/October 2015
13
Teams with Expertise/Interest
(only asked workshop attendees so far)




English/Spanish/Chinese

Yes: IBM, HITS, NYU, RPI

Maybe: JHU, LCC, BBN

…
English/Chinese:

PKU, Tsinghua, a lot more Chinese teams

…
English/Spanish

CSFG

Maybe: UIUC

…
Speed-dating between Chinese & Spanish teams
14
Another Ambitious Proposal:
Cross-lingual Slot Filling



Chinese-to-English Slot Filling

Annotation guideline available

BLENDER Pilot system (Snover et al., 2011)

Off-cycle Pilot in DEFT (Jan 2015): RPI, Univ. of Washington, Univ. of
Wisconsin, CMU
Spanish-to-English Slot Filling

Evaluation proposed in KBP2013

Guideline, Annotation available
Tri-lingual Slot Filling
15
Other Issues

Mention Definition




Extraction for linking
Nested mentions
Posters
Scoring


Is the current scoring reasonable?
If we do EDL on 50K documents and only partial entities/documents
are manually annotated, how to evaluate clustering performance?

Add new entity types in 2015: Location and Facility?

Add Non-name concepts (e.g., nominal mentions)?

Link “wife” in “Obama’s wife” to “Michelle” in KB
16
Other Issues


Evidence & Confidence
 Annotation to provide evidence on NIL
 System confidence/justification
Correct annotation errors



Need community effort to report errors / share corrections
Improve/extend annotation guidelines, check IAA
Shift some annotation cost from annotating new data to
knowledge resource construction?

Current research bottlenecks on coreference and slot filling are on
knowledge acquisition instead of more labeled data

e.g., semantic distance between any two nominals for coreference

e.g., large-scale clean paraphrases for slot filling
17
Download