KBA a TREC evaluatio n Entity-oriented filtering of large streams John R. Frank jrf@mit.edu Ian Soboroff ian.soboroff@nist.gov Max Kleiman-Weiner maxkw@mit.edu Dan A. Roberts drob@mit.edu http://trec-kba.org KBA a TREC evaluatio n Date: Tue, 13 Mar 2012 02:45:40 +0000 From: Google Alerts <googlealerts-noreply@google.com> Subject: Google Alert - "John R. Frank" === Web - 2 new results for ["John R. Frank"] === John R. Frank SPOKANE, Wash. - John R. Frank, 55, died March 4, 2012, in Coeur d' Alene, Idaho. Survivors include: his wife, Miki; daughter, Patricia Frank; ... <http://www.hutchnews.com/obituaries/Frank--John-CP> In Memory of John R Frank Biography. John R. Frank, age 55, passed away at Sacred Heart Medical Center in Spokane, WA, on March 4, 2012. John was born in Hutchison, KS, ... <http://www.englishfuneralchapel.com/sitemaker/sites/Englis1/obit.cgi?user=583335Frank> KBA a TREC evaluatio n Entities in Wikipedia or another Knowledge Base 2012 Task: Filtering to Recommend Citations 1) Initialize with a target WP entity • state of WP from Jan 2012 Sponsors: Diffeo Automatically recommend new edits Your KBA System 2) Iterate over stream of text items • Oct-Dec 2011: train on labels 3) For each, output confidence between 0, 1 • Jan-Apr 2012: labels hidden Content Stream • 462M texts, 40% English • 4,973 hourly chunks of a 105 docs/hour • News, blogs, forums, and link shortening KBA a TREC evaluatio n s3://aws-publicdatasets/trec/kba/kba-stream-corpus-2012/ KBA a TREC evaluatio n KBA a TREC evaluatio n Accelerate? rate of assimilation << stream size # editors << # entities << # mentions (definition of a “large” KB) KBA a TREC evaluatio n How many days must a news article wait before being cited in Wikipedia? KBA a TREC evaluatio n Complex entity with many relationships and attributes. KBA a TREC evaluatio n Has many interests, including trying to takeover UK soccer teams. His empire includes many entities… Note: Usmanov not mentioned in this text! Elaborate link trails… Citation #18 KBA a TREC evaluatio n Example KBA Rating Task Published: March 31, 2012 Impact of Thoughts on Water By Denis Gorce-Bourge Water covers 70% of our Blue planet and our body is made of about 70% water. Masaru Emoto is a Japanese Photographer and scientist. He is known over the world for his remarkable work on water and its deep connection with individual and collective consciousness. For decades, Masaru took pictures of frozen crystals of water and tested the direct influence of the environment on the quality of those crystals. KBA a TREC evaluatio n Example KBA Rating Task Published: March 31, 2012 Impact of Thoughts on Water By Denis Gorce-Bourge Water covers 70% of our Blue planet and our body is made of about 70% water. Masaru Emoto is a Japanese Photographer and scientist. He is known over the world for his remarkable work on water and its deep connection with individual and collective consciousness. For decades, Masaru took pictures of frozen crystals of water and tested the direct influence of the environment on the quality of those crystals. KBA a TREC evaluatio n KBA a TREC evaluatio n Interannotator Agreement 97.6% +/- 1.4% (N=5365) coref 69.5% +/- 2.7% (N=1352) central 70.9% +/- 2.0% (N=2403) relevant 58.4% +/- 3.4% (N=884) neutral 84.9% +/- 2.0% (N=2599) garbage 82.6% +/- 1.8% (N=3200) central relevant 89.0% +/- 1.7% (N=3551) central relevant neutral KBA TRECing the continental divide between NLP and IR a TREC evaluatio n NLP: • Data parsing centric • Universal annotation • Scores probabilities • Reductionist IR: • User task centric • Variation in interpretation • Scores cascading lists • Constructionist, emergence KBA a TREC evaluatio n string matching task generator 91% recall 15% precision 26% F1 KBA number of "central" rated documents per hour per entity log(count) a TREC evaluatio n 10 1 0 500 1000 Aharon_Barak Annie_Laurie_Gaylor Bill_Coen Charlie_Savage Frederick_M._Lawrence Jim_Steyer Mario_Garnero Rodrigo_Pimentel Satoshi_Ishii William_D._Cohan 1500 2000 2500 3000 Alex_Kapranos Basic_Element_(company) Boris_Berezovsky_(businessman) Darren_Rowse Ikuhisa_Minowa Lisa_Bloom Masaru_Emoto Roustam_Tariko Vladimir_Potanin William_H._Gates,_Sr 3500 4000 4500 Alexander_McCall_Smith Basic_Element_(music_group) Boris_Berezovsky_(pianist) Douglas_Carswell James_McCartney Lovebug_Starski Nassim_Nicholas_Taleb Ruth_Rendell William_Cohen 5000 KBA a TREC evaluatio n KBA 2013 More entity types with an emphasis on temporality in the stream. Target Entities KB Centrally Relevant Training Data Annotation People and Organizations Wikipedia or maybe Freebase Citation worthy Judgments from early stream High recall on all mentioning docs. Pharmaceutical Compounds Merck KB? Reporting of Adverse Drug Reaction (ADR) (an event) (same) Focus recall on first person reporting & negative reactions? Event-type Entities WP/FB for cluster Provides of entities, causality info possibly also event itself. Judgments on docs for that Type-of-Event but different specific event. Find training data from TDT? Use citations in Category:Current _events? Defined by a cluster of entities and possibly a Type-of-Event from a taxonomy Judge post-hoc? Pool top-K filtered docs, or use each KBA run as separate KBP input. (1000x filter) KBA Clusters of related entities and/or event-type entities KBX KBP Output KB Cold Start queries focused on: nil entities related to target cluster and/or causality of event Must coordinate choice of KBA target entities with desired content of KBs for Cold Start queries. KBA Stream Corpus 2012 (or the new Stream Corpus 2013) • 462M texts, 40% English • 4,973 hourly chunks of a 105 docs/hour • News, blogs, forums, and link shortening KBA a TREC evaluatio n Sponsors: Thank You. Diffeo KBA a TREC evaluatio n Thanks for your time. John R. Frank jrf@mit.edu http://trec-kba.org