Information Extraction in the Past 20 Years

advertisement
Information Extraction
in the Past 20 Years:
Traditional vs. Open
Heng Ji
jih@rpi.edu
Acknowledgement: some slides from Radu Florian and Stephen Soderland
 Long successful run
– MUC
– CoNLL
– ACE
– TAC-KBP
– DEFT
– BioNLP
 Programs
– MUC
– ACE
– GALE
– MRP
– BOLT
– DEFT
2
 Genres
– Newswire
– Broadcast news
– Broadcast conversations
– Weblogs
– Blogs
– Newsgroups
– Speech
– Biomedical data
– Electronic Medical Records
Quality
3
Portability
Quality Challenges
4
Where have we been?

We’re thriving


We’re making slow but consistent progress




Relation Extraction
Event Extraction
Slot Filling
We’re running around in circles


Entity Linking
Name Tagging
We’re stuck in a tunnel

Entity Coreference Resolution
5
Name Tagging: “Old” Milestones
Year
Tasks &
Resources
Methods
F-Measure
Example
References
1966
-
First person name tagger with
punch card
30+ decision tree type rules
-
(Borkowski et al.,
1966)
1998
MUC-6
MaxEnt with diverse levels of
linguistic features
97.12%
(Borthwick and
Grishman, 1998)
2003
CONLL
System combination;
Sequential labeling with
Conditional Random Fields
89%
(Florian et al., 2003;
McCallum et al., 2003;
Finkel et al., 2005)
2006
ACE
Diverse levels of linguistic
features, Re-ranking, joint
inference
~89%
(Florian et al., 2006; Ji
and Grishman, 2006)

Our progress compared to 1966:


More data, a few more features and more fancy learning algorithms
Not much active work after ACE because we tend to believe it’s a solved
problem…
6
The end of extreme happiness is sadness…
State-of-the-art reported in papers
7
The end of extreme happiness is sadness…

Experiments on ACE2005 data
8
Challenges
 Defining or choosing an IE schema
 Dealing with genres & variations
–Dealing with novelty
 Bootstrapping a new language
 Improving the state-of-the-art with unlabeled data
 Dealing with a new domain
 Robustness
9
99 Schemas of IE on the Wall…
 Many IE schemas over the years:
– MUC – 7 types
• PER, ORG, LOC, DATE, TIME, MONEY, PERCENT
– ACE – 5 7 5 types
• PER, ORG, GPE, LOC, FAC, WEA, VEH
• Has substructure (subtypes, mention types, specificity, roles)
– CoNLL: 4 types
• ORG, PER, LOC, MISC
– ONTONotes: 18 types
• CARDINAL,DATE,EVENT,FAC,GPE,LANGUAGE,LAW,LOC,MONEY,NORP,ORDIN
AL,ORG,PERCENT,PERSON,PRODUCT,QUANTITY,TIME,WORK_OF_ART
– IBM KLUE2: 50 types, including event anchors
– Freebase categories
– Wikipedia categories
 Challenges:
– Selecting an appropriate schema to model
– Combining training data
10
My Favorite Booby-trap Document
http://www.nytimes.com/2000/12/19/business/lvmh-makes-a-two-part-offer-for-donna-karan.html
 LVMH Makes a Two-Part Offer for Donna Karan
 By LESLIE KAUFMAN
Published: December 19, 2000
 The fashion house of Donna Karan, which has long struggled to achieve financial equilibrium, has finally
found a potential buyer. The giant luxury conglomerate LVMH-Moet Hennessy Louis Vuitton, which has
been on a sustained acquisition bid, has offered to acquire Donna Karan International for $195 million in a
cash deal with the idea that it could expand the company's revenues and beef up accessories and
overseas sales.
 At $8.50 a share, the LVMH offer represents a premium of nearly 75 percent to the closing stock price on
Friday. Still, it is significantly less than the $24 a share at which the company went public in 1996. The
final price is also less than one-third of the company's annual revenue of $662 million, a significantly
smaller multiple than European luxury fashion houses like Fendi were receiving last year.
 The deal is still subject to board approval, but in a related move that will surely help pave the way, LVMH
purchased Gabrielle Studio, the company held by the designer and her husband, Stephan Weiss, that
holds all of the Donna Karan trademarks, for $450 million. That price would be reduced by as much as
$50 million if LVMH enters into an agreement to acquire Donna Karan International within one year. In a
press release, LVMH said it aimed to combine Gabrielle and Donna Karan International and that it
expected that Ms. Karan and her husband ''will exchange a significant portion of their DKI shares for, and
purchase additional stock in, the combined entity.''
11
Analysis of an Error
Donna Karan International
12
Analysis of an Error: How can you Tell?
FAC
Saddam Hussein International Airport 8
FAC
Baghdad International 1
ORG
Amnesty International 3
FAC
International Space Station 1
ORG
International Criminal Court 1
ORG
Habitat for Humanity International 1
ORG
U-Haul International 1
FAC
Saddam International Airport 7
ORG Karan
International
Committee of the Red Cross 4
Donna
International
ORG
International Committee for the Red Cross 1
FAC
International Press Club 1
Ronald
International
ORG Reagan
American International
Group Inc. 1
ORG
Boots and Coots International Well Control Inc. 1
ORG
International Committee of Red Cross 1
Saddam
International
ORGHussein
International
Black Coalition for Peace and Justice 1
FAC
Baghdad International Airport RG Center for
Strategic and International Studies 2
Dana International
ORG
International
Monetary Fund 1
13
14
Dealing With Different Genres:
 Weblogs:
– All lower case data
• obama has stepped up what bush did even to the point of helping our enemy in Libya.
– Non-standard capitalization/title case
• LiveLeak.com - Hillary Clinton: Saddam Has WMD, Terrorist Ties (Video)
Solution: Case Restoration (truecasing)
15
}
16
Out-of-domain data
Volunteers have also aided victims of numerous other disasters, including hurricanes
Katrina, Rita, Andrew and Isabel, the Oklahoma City bombing, and the September 11
terrorist attacks.
17
Out-of-domain Data
 Manchester United manager Sir Alex Ferguson got a boost on Tuesday as a horse he part
owns What A Friend landed the prestigious Lexus Chase here at Leopardstown racecourse.
18
Bootstrapping a New Language
 English is resource-rich:
–Lexical resources: gazetteers
–Syntactic resources: PennTreeBank
–Semantic resources: Wordnet, entity-labeled data (MUC, ACE,
CoNLL), Framenet, PropBank, NomBank, OntoBank
 How can we leverage these resources in other languages?
 MT to the rescue!
Mention Detection Transfer
 ES: El soldado nepalés fue baleado por ex soldados haitianos cuando patrullaba la zona
central de Haiti , informó Minustah .
 EN: The Nepalese soldier was gunned down by former Haitian soldiers when patrullaba the
central area of Haiti , reported minustah .
El
soldado
nepalés
fue
baleado
por
ex
soldados
haitianos
cuando
patrullaba
la
zona
central
de
Haiti
,
informó
Minustah
.
The
Nepalese
soldier
was
gunned
down by
former
Haitian
soldiers
when
patrolling
the
central
area
of
Haiti
,
reported
minustah
.
O
B-GPE
B-PER
O
O
OO
O
B-GPE
B-PER
O
O
O
O
B-LOC
O
B-GPE
O
O
O
O
System
Spanish
Arabic
Chinese
F-measure
Direct Transfer
66.5
Source Only (100k words)
71.0
Source Only (160k words)
76.0
Source + Transfer
78.5
Direct Transfer
51.6
Source Only (186k tokens)
79.6
Source + Transfer
80.5
Direct Transfer
58.5
Source Only
74.5
Source + Transfer
76.0
How to deal with out-of-domain data? How to
even detect if you’re out of domain?
How to deal with unseen WotD? (e.g. ISIS,
ISIL, IS, Ebola)
How to improve significantly the state-of-theart using unlabeled data?
22
What’s Wrong?

Name tagger s are getting old (trained from 2003 news & test on 2012 news)

Genre adaptation (informal contexts, posters)

Revisit the definition of name mention – extraction for linking

Limited types of entities (we really only cared about PER, ORG, GPE)

Old unsolved problems



Identification: “Asian Pulp and Paper Joint Stock Company , Lt. of Singapore”

Classification: “FAW has also utilized the capital market to directly finance,…” (FAW = First
Automotive Works)
Potential Solutions for Quality

Word clustering, Lexical Knowledge Discovery (Brown, 1992; Ratinov and Roth, 2009; Ji
and Lin, 2010)

Feedback from Linking, Relation, Event (Sil and Yates, 2013; Li and Ji, 2014)
Potential Solutions for Portability

Extend entity types based on AMR (140+)
23
Entity Linking Milestones







2006: The first definition of Wikification task (Bunescu and Pasca, 2006)
2009: TAC-KBP Entity Linking launched (McNamee and Dang, 2009)
2008-2012: Supervised learning-to-rank with diverse levels of features
such as entity profiling, various popularity and similarity measures
were developed (Gao et al., 2010; Chen and Ji, 2011; Ratinov et al., 2011;
Zheng et al., 2010; Dredze et al., 2010; Anastacio et al., 2011)
2008-2013: Collective Inference, Coherence measures were developed
(Milne and Witten, 2008; Kulkarni et al., 2009; Ratinov et al., 2011; Chen
and Ji, 2011; Ceccarelli et al., 2013; Cheng and Roth, 2013)
2012: Various applications(e.g., Coreference resolution (Ratinov & Roth,
2012) – Dan’s talk
2014: TAC-KBP Entity Discovery and Linking (end-to-end name
tagging, cross-document entity clustering, entity linking) (Ji et al., 2014)
Many different versions of international evaluations were inspired from
TAC-KBP; more than 130 papers have been published
24
Current Linking Problems and Possible Solutions



State-of-the-art Entity Linking: 85% B-cubed+ F-score on formal
genres and 70% B-cubed+ F-score on informal genres
State-of-the-art Entity Discovery and Linking: 66% Discovery and
Linking F-score, 73% Clustering CEAFm F-score
Remaining Challenges






Popularity bias
Require better meaning representation
Select collaborators from rich contexts
Knowledge gap between source and KB
Cross-lingual Entity Linking (name translation problem)
Potential Solutions:



Deep knowledge acquisition and representation (e.g., AMR)
Better graph search alignment algorithms
Make more people excited about Chinese and Spanish
25
Slot Filling Milestones

2009-2014: Top systems achieved 30%-40% F-measure

Ground-truth is created based on manual assessment of pooled system
output – relative recall; score may appear lower with stronger teams

2014 queries are more challenging than 2013; including some ambiguous
queries sharing with entity linking (Stephen’s talk)

Consistent progress on individual system (RPI, test on 2014 data):
2010: 20%  2011: 22%  2013: 28% 2014: 34%

Successful Methods

Multi-label Multi-instance learning (Seadeanu et al., 2012)

Combination of distant supervision with heuristic rules and patterns
(Roth et al., 2013)

Cross-source Cross-system Inference (Chen et al., 2011; Yu et al., 2014)

Linguistic constraints (Yu et al., 2014) – Heng’s one-week pencil-andpaper work to semi-automatically acquire trigger phrases; an awfully
simple trigger scoping method beat all 2013 systems
26
Have the Error
Sources
Changed
over
Years?
0.3
2010
0.25
0.2
0.15
0.1
0.05
0
(Min and Grishman, 2011)
(Yu and Ji, 2014)
2014
27/35
Blame Ourselves First…

Non-verb and multi-word expression as triggers



his men back to their compound
Knowledge scarcity - Long-tail

A suicide bomber detonated explosives at the entrance to a crowded

medical teams carting away dozens of wounded victims

Today I was let go from my job after working there for 4 1/2 years.

Possible solution: increase coverage with FrameNet (Li et al., 2014)
Global context

I didn't want to hurt him . I miss him to death.

I threw stone out of the window. vs. I threw him out of the window.

Ellison to spend $10.3 billion to get his company.

We believe that the likelihood of them using those weapons goes up.

Fifteen people were killed and more than 30 wounded Wednesday as a suicide
bomber blew himself up on a student bus in the northern town of Haifa

Possible solution: joint modeling between triggers and arguments (Li et al., 2013)
28
Then Blame Others…

Fundamental language problem – ambiguity and variety

Coreference, coreference, coreference…
 25% of the examples involve coreference which is beyond current system
capabilities, such as nominal anaphors and non-identity coreference



Almost overnight, he became fabulously rich, with a $3-million book deal, a
$100,000 speech making fee, and a lucrative multifaceted consulting
business, Giuliani Partners. … His consulting partners included seven of
those who were with him on 9/11, and in 2002 Alan Placa, his boyhood pal,
went to work at the firm.
After successful karting career in Europe, Perera became part of the Toyota
F1 Young Drivers Development Program and was a Formula One test driver
for the Japanese company in 2006.
“a woman charged with running a prostitution ring … her business,
Pamela Martin and Associates”
29
Then Blame Others…

Paraphrase, paraphrase, paraphrase…

“employee/member”:
Sutil, a trained pianist, tested for Midland in 2006 and raced for Spyker in 2007
where he scored one point in the Japanese Grand Prix.
Daimler Chrysler reports 2004 profits of $3.3 billion; Chrysler earns $1.9 billion.
In her second term, she received a seat on the powerful Ways and Means
Committee
Jennifer Dunn was the face of the Washington state Republican Party for more
than two decades
Buchwald lied about his age and escaped into the Marine Corps.
By 1942, Peterson was performing with one of Canada's leading big bands, the
Johnny Holmes Orchestra.


“spouse”:
Buchwald 's 1952 wedding -- Lena Horne arranged for it to be held in London 's
Westminster Cathedral -- was attended by Gene Kelly , John Huston , Jose Ferrer ,
Perle Mesta and Rosemary Clooney , to name a few

30
Then Blame Others…

Inference, Inference, Inference…

systems would benefit from specialists which are able to reason about times,
locations, family relationships, and employment relationships.

People Magazine has confirmed that actress Julia Roberts has given birth to
her third child a boy named Henry Daniel Moder. Henry was born Monday
in Los Angeles and weighed 8 lbs. Roberts, 39, and husband Danny Moder,
38, are already parents to twins Hazel and Phinnaeus who were born in
November…
He [Pascal Yoadimnadji] has been evacuated to France on Wednesday after
falling ill and slipping into a coma in Chad, Ambassador Moukhtar Wawa
Dahab told The Associated Press. His wife, who accompanied Yoadimnadji
to Paris, will repatriate his body to Chad, the amba.  is he dead? in Paris?
Until last week, Palin was relatively unknown outside Alaska…  does
she live in Alaska?
The list says that the state is owed $2,665,305 in personal income taxes by
singer Dionne Warwick of South Orange, N.J., with the tax lien dating back
to 1997.  does she live in NJ?



31
Portability/Scalability Challenges
32
Defining the Problem
• Deep understanding of all possible relations?
• Open IE, pre-emptive IE, on-demand IE…
10/15/2014
DEFT PI meeting -- U. Washington
33
Defining the Problem
• Deep understanding of all possible relations?
• Deep Extraction for Focused Tasks (D.E.F.T.)
– User has a focused information need:
• A few dozen relations, several entity types:
• Date_of_birth(per, date), city_of_headquarters(org, city), …
• Treatment(substance, condition), studies_disease(per/org, condition),…
• Arrive_in(per, loc), meet_with(per, per), unveil(org, product), …
– Quickly train an extractor for the task
• Domain independent: parsing, Open IE, SRL, …
• Task specific: semantic tagging, extraction patterns, …
TAC-KBP
Freedman et al. Extreme Extraction -- Machine Reading in a Week. EMNLP 2011
Zhang et al. NewsSpike Event Extractor, in review
10/15/2014
DEFT PI meeting -- U. Washington
34
Aim for the Head ?
A Zipfian Distribution of surface forms to express a textual relation
Dead simple
Frequency
The real challenge
A hopeless case
Patterns
10/15/2014
DEFT PI meeting -- U. Washington
35
Open IE for KBP
• Advantages of Open IE
– Robust
– Massively scalable
– Works out of the box
– Finds whatever relations are expressed in the text
– Not tied to an ontology of relations
• Disadvantages
– Finds whatever relations are expressed in the text
– Not tied to an ontology of relations
• Challenge
– Map Open IE to an ontology of relations
– Minimum of user effort
github/knowitall/openie
10/15/2014
DEFT PI meeting -- U. Washington
36
OpenIE–KBP Rule Language
Arg1
Rel
Arg2
(Smith, was appointed , Acting Director of Acme Corporation)
entity
slotfill
per:employee_or_member_of (Smith, Acme Corporation)
Terms in Rule
Example
Target relation:
Query entity in:
Slotfill in:
Slotfill type:
Arg1 terms:
Relation terms:
Arg2 terms:
per:employee_or_member_of
Arg1
Arg2
Organization
appointed
<JobTitle> of
10/15/2014
DEFT PI meeting -- U. Washington
37
Hits the Head, but …
• High precision, average recall
• Limited recall from Open IE,
– Good with verb-based relations
– Weak on noun-based relations
• “Implicit relation” patterns
“Bashardost, 43, is …”
(Baradost, [has age], 43)
“… the Election Complaints Commission (ECC)…”
(Election Complaints Commission, [has acronym], ECC)
“French journalist Jean LeGall reported that …”
(Jean LeGall, [has job title], journalist )
(Jean LeGall, [has nationality], French )
10/15/2014
DEFT PI meeting -- U. Washington
38
NewsSpike Event Extractor
• Extracts event relations from news streams
– Event = event_phrase(arg1_type, arg2_type)
• NewsSpike = (entity1, entity2, date, {sentences})
– from parallel news streams
– Open IE identifies entity1, entity2, and event phrase
– a spike in frequency on that date indicates
an event between entity1 and entity2
• Automatically discover relations not covered by Freebase
arrive_in (person, location)
beat (sports_team, sports_team)
meet_with (person, person)
nominate (person/politician, person)
unveil (organization, product)
…
10/15/2014
DEFT PI meeting -- U. Washington
39
NewsSpike Architecture
Parallel news
streams
Discover
events
Group
E=e(t1,t2)
Test
sentences
s
Generate
training data
NS=(a1,a2,d,S)
(a1,a2,t)
S={s
rr1 r2r1r,3rs2 ,s3}
r1r41 r2r25r33
NewsSpike w/
Parallel sentences
Training Phase
10/15/2014
Event
E=e(t
(a1,a12,t,t)2)
,a )
rs→E(a
r11rr2r2rr3r3r1 2
s’→E(a’
r41 r25 31,a’2)
input
learn
Extractions
s→ E(a1,a2)
extract
Event
Extractor
Training sentences
Testing Phase
DEFT PI meeting -- U. Washington
40
High Quality Training
• Paraphrases in NewsSpike gives positive training
• Negative training from Temporal negation heuristic:
– If event phrases e1 and e2 are in the same NewsSpike
– and one of them is negated
– e1 is probably not a paraphrase of e2
“Team1 faces Team2” “Team1 did not beat Team2”
face ≠ beat
• High precision from negative training
10/15/2014
DEFT PI meeting -- U. Washington
41
High Precision Event Extractor
Doubles the area under PR curve vs. Universal Schemas
NewsSpike-E2
on news stream
Universal Schemas
on news stream
Universal Schemas on NYT
10/15/2014
DEFT PI meeting -- U. Washington
42
Download