KBA 2012 Overview slides - TREC Knowledge Base Acceleration

advertisement
KBA
a TREC evaluatio
n
Entity-oriented filtering
of large streams
John R. Frank
jrf@mit.edu
Ian Soboroff
ian.soboroff@nist.gov
Max Kleiman-Weiner
maxkw@mit.edu
Dan A. Roberts
drob@mit.edu
http://trec-kba.org
KBA
a TREC evaluatio
n
Date: Tue, 13 Mar 2012 02:45:40 +0000
From: Google Alerts <googlealerts-noreply@google.com>
Subject: Google Alert - "John R. Frank"
=== Web - 2 new results for ["John R. Frank"] ===
John R. Frank
SPOKANE, Wash. - John R. Frank, 55, died March 4, 2012, in Coeur d' Alene,
Idaho. Survivors include: his wife, Miki; daughter, Patricia Frank; ...
<http://www.hutchnews.com/obituaries/Frank--John-CP>
In Memory of John R Frank
Biography. John R. Frank, age 55, passed away at Sacred Heart Medical
Center in Spokane, WA, on March 4, 2012. John was born in Hutchison, KS, ...
<http://www.englishfuneralchapel.com/sitemaker/sites/Englis1/obit.cgi?user=583335Frank>
KBA
a TREC evaluatio
n
Entities in Wikipedia or
another Knowledge Base
2012 Task:
Filtering to Recommend Citations
1) Initialize with a target WP entity
• state of WP from Jan 2012
Sponsors:
Diffeo
Automatically
recommend
new edits
Your KBA
System
2) Iterate over stream of text items
• Oct-Dec 2011: train on labels
3) For each, output confidence between 0, 1
• Jan-Apr 2012: labels hidden
Content Stream
• 462M texts, 40% English
• 4,973 hourly chunks of a 105 docs/hour
• News, blogs, forums, and link shortening
KBA
a TREC evaluatio
n
s3://aws-publicdatasets/trec/kba/kba-stream-corpus-2012/
KBA
a TREC evaluatio
n
KBA
a TREC evaluatio
n
Accelerate?
rate of assimilation << stream size
# editors << # entities << # mentions
(definition of a “large” KB)
KBA
a TREC evaluatio
n
How many days must a news article wait before
being cited in Wikipedia?
KBA
a TREC evaluatio
n
Complex entity with many
relationships and attributes.
KBA
a TREC evaluatio
n
Has many interests, including trying to
takeover UK soccer teams.
His empire includes many entities…
Note: Usmanov not mentioned in this text!
Elaborate link trails…
Citation #18
KBA
a TREC evaluatio
n
Example KBA Rating Task
Published: March 31, 2012
Impact of Thoughts on Water
By Denis Gorce-Bourge
Water covers 70% of our Blue planet and our body is made
of about 70% water.
Masaru Emoto is a Japanese Photographer and scientist.
He is known over the world for his remarkable work on
water and its deep connection with individual and
collective consciousness.
For decades, Masaru took pictures of frozen crystals of
water and tested the direct influence of the environment
on the quality of those crystals.
KBA
a TREC evaluatio
n
Example KBA Rating Task
Published: March 31, 2012
Impact of Thoughts on Water
By Denis Gorce-Bourge
Water covers 70% of our Blue planet and our body is made
of about 70% water.
Masaru Emoto is a Japanese Photographer and scientist.
He is known over the world for his remarkable work on
water and its deep connection with individual and
collective consciousness.
For decades, Masaru took pictures of frozen crystals of
water and tested the direct influence of the environment
on the quality of those crystals.
KBA
a TREC evaluatio
n
KBA
a TREC evaluatio
n
Interannotator Agreement
97.6% +/- 1.4% (N=5365) coref
69.5% +/- 2.7% (N=1352)
central
70.9% +/- 2.0% (N=2403)
relevant
58.4% +/- 3.4% (N=884)
neutral
84.9% +/- 2.0% (N=2599)
garbage
82.6% +/- 1.8% (N=3200)
central
relevant
89.0% +/- 1.7% (N=3551)
central
relevant neutral
KBA
TRECing the continental divide
between NLP and IR
a TREC evaluatio
n
NLP:
• Data parsing centric
• Universal annotation
• Scores  probabilities
• Reductionist
IR:
• User task centric
• Variation in interpretation
• Scores  cascading lists
• Constructionist, emergence
KBA
a TREC evaluatio
n
string matching
task generator
91% recall
15% precision
26% F1
KBA
number of "central" rated documents per hour per entity
log(count)
a TREC evaluatio
n
10
1
0
500
1000
Aharon_Barak
Annie_Laurie_Gaylor
Bill_Coen
Charlie_Savage
Frederick_M._Lawrence
Jim_Steyer
Mario_Garnero
Rodrigo_Pimentel
Satoshi_Ishii
William_D._Cohan
1500
2000
2500
3000
Alex_Kapranos
Basic_Element_(company)
Boris_Berezovsky_(businessman)
Darren_Rowse
Ikuhisa_Minowa
Lisa_Bloom
Masaru_Emoto
Roustam_Tariko
Vladimir_Potanin
William_H._Gates,_Sr
3500
4000
4500
Alexander_McCall_Smith
Basic_Element_(music_group)
Boris_Berezovsky_(pianist)
Douglas_Carswell
James_McCartney
Lovebug_Starski
Nassim_Nicholas_Taleb
Ruth_Rendell
William_Cohen
5000
KBA
a TREC evaluatio
n
KBA 2013
More entity types with an emphasis on temporality in the stream.
Target Entities
KB
Centrally
Relevant
Training Data
Annotation
People and
Organizations
Wikipedia or
maybe Freebase
Citation worthy
Judgments from
early stream
High recall on all
mentioning docs.
Pharmaceutical
Compounds
Merck KB?
Reporting of
Adverse Drug
Reaction (ADR)
(an event)
(same)
Focus recall on
first person
reporting &
negative
reactions?
Event-type
Entities
WP/FB for cluster Provides
of entities,
causality info
possibly also
event itself.
Judgments on
docs for that
Type-of-Event
but different
specific event.
Find training data
from TDT?
Use citations in
Category:Current
_events?
Defined by a
cluster of entities
and possibly a
Type-of-Event
from a taxonomy
Judge post-hoc?
Pool top-K filtered docs,
or use each KBA run as
separate KBP input.
(1000x filter)
KBA
Clusters of
related entities
and/or
event-type
entities
KBX
KBP
Output
KB
Cold Start queries
focused on:
nil entities
related to target
cluster
and/or
causality of event
Must coordinate choice of
KBA target entities with
desired content of KBs for
Cold Start queries.
KBA Stream Corpus 2012 (or the new Stream Corpus 2013)
• 462M texts, 40% English
• 4,973 hourly chunks of a 105 docs/hour
• News, blogs, forums, and link shortening
KBA
a TREC evaluatio
n
Sponsors: Thank You.
Diffeo
KBA
a TREC evaluatio
n
Thanks for your time.
John R. Frank
jrf@mit.edu
http://trec-kba.org
Download