slides - University of Wisconsin–Madison

advertisement
Social + Mobile + Commerce
Entity Extraction, Linking, Classification, and
Tagging for Social Media:
A Wikipedia Based Approach
Abhishek Gattani, Digvijay Lamba, Nikesh Garera, Mitul Tiwari3, Xiaoyong Chai,
Sanjib Das1, Sri Subramaniam, Anand Rajaraman2, Venky Harinarayan2, AnHai Doan1;
@WalmartLabs, 1University of Wisconsin-Madison, 2Cambrian Ventures, 3LinkedIn
Aug 27th, 2013
The Problem
“Obama gave an immigration speech while on vacation in Hawaii”
Entity Extraction
“Obama” is a Person, “Hawaii” is a location
Entity Linking
“Obama” -> en.wikipedia.org/wiki/Barack Obama
“Hawaii” -> en.wikipedia.org/wiki/Hawaii
Classification
“Politics”, “Travel”
Tagging
“Politics”, “Travel”, “Immigration”, “President Obama”, “Hawaii”
On Social Media Data
Short Sentences
Ungrammatical, misspelled, lots of acronyms
Social Context
From previous conversation/interests: “Go Giants!!”
Large Scale
10s of thousands of updates a second
Lots of Topics
New topics and themes every day. Large scale of topics
Why? – Use cases
• Used extensively at Kosmix and later at @WalmartLabs
– Twitter event monitoring
– In context ads
– User query parsing
– Product search and recommendations
– Social Mining
• Use Cases
– Central topic detection for a web page or tweet.
– Getting a stream of tweets/messages about a topic.
• Small team at scale
– About 3 engineers at a time
– Processing the entire Twitter firehose
Based on a Knowledge Base
• Global: Covers a wide range of topics. Includes WordNet, Wikipedia, Chrome,
Adam, MusicBrainz, Yahoo Stocks etc.
• Taxonomy: Converted Wikipedia graph to a hierarchical taxonomy with IsA
edges which are transitive
• Large: 6.5 Million hierarchical concepts with 165 Million relationships
• Real Time: Constantly updated from sources, analyst curation, event detection
• Rich: Synonyms, Homonyms, Relationships, etc
Published:
Building, maintaining, and using knowledge bases: A report from the trenches.
In SIGMOD, 2013.
Annotate with Contexts
Every social conversation takes place in a context that changes what it means
A Real Time User Context
What topics does this user talk about?
A Real Time Social Context
What topics are usually in context of a Hashtag, Domain, or KB Node
A Web Context
Topics in a link in a tweet. What are the topics in KB Node’s Wiki Page?
Compute the context at scale
Example Contexts
Barack Obama
Social: Putin, Russia, White House, SOPA, Syria, Homeownership,
Immigration, Edward Snowden, Al Qaeda
Web: President, White House, Senate, Illinois, Democratic, United
States, US Military, War, Michelle Obama, Lawyer, African
American
www.whitehouse.gov
Social: Petition, Barack Obama, Change, Healthcare, SOPA
#Politics
Social: Barack Obama, Russia, Rick Scott, State Dept, Egypt,
Snowden, War, Washington, House of Representatives
@Whitehouse
User: Barack Obama, Housing Market, Homeownership, Mortgage
Rates, Phoenix, Americans, Middle Class Families
Key Differentiators – why it works?
The Knowledge Base
Interleave several problems
Use of Context
Scale
Rule Based
How: First Find Candidate Mentions
“RT Stephen lets watch. Politics of Love is about Obama’s election @EricSu”
Step 1: Pre-Process – clean up tweet
“Stephen lets watch. Politics of Love is about Obama’s election”
Step 2: Find Mentions – All in KB + detectors
[“Stephen”, “lets”, “watch” “Politics”, “Politics of Love”, “is”, “about”,
“Obama”, “Election”]
Step 3: Initial Rules – Remove obvious bad cases
[“Stephen”, “watch”, “Politics”, “Politics of Love”, “Obama”, “Election”]
Step 4: Initial scoring – Quick and dirty
[“Obama”: 10, “Politics of Love”: 9, “Stephen”:7, “watch”: “7”., “Politics”: 6,
“Election”: 6,]
How: Add mention features
Step 5: Tag and Classify– Quick and dirty
“Obama”: Presidents, Politicians, People; Politics, Places, Geography
“Politics of Love”: Movies, Political Movies, Entertainment, Politics
“Stephen”: Names, People
“watch”: Verb, English Words, Language, Fashion Accessories, Clothing
“Politics”: Politics
“Election”: Political Events, Politics, Government
Tweet: Politics, People, Movies, entertainment.. Etc.
Step 6: Add features
Contexts, similarity to the tweet, similarity to user or website, popularity
measures, is it interesting?, social signals
How: Finalize mentions
Step 7: Apply Rules
“Obama”: Boost popular stuff and proper nouns
“Politics of Love”: Boost Proper nouns, Boost due to “Watch”
“Stephen”: Delete out of context names
“watch”: Remove verbs
“Politics”: Boost tags which are also mentions
“Election”: Boost mentions in the central topic
Step 8: Disambiguate
KB has many meanings – Pick One
Obama: Barrack Obama. Popularity, Context, Social Popularity
Watch: verb. Clothing is not in context
Context is most important! We use many contexts for most success.
How: Finalize
Step 9: Rescore
Logistic Regression model on all the features
Step 9: Re-tag
Use latest scores and only picked meanings
Step 9: Editorial Rules
A regular expression like language for analysts to pick/book
Does it work? – Evaluation of Entity Extraction
• For 500 English Tweets we hand curate a list of mentions.
• For 99 of those built a comprehensive list of tags.
• Entity extraction:
• Works well for people, organizations,
locations
• Works great for unique names
• Works badly for Media: Albums, Songs,
• Generic Problem:
• Too many movies, books, albums and
songs have “Generic” Names
• Inception, It’s Friday etc.
• Even when popular they are often used
“in conversation”
• Very hard to disambiguate.
• Very hard to find which ones are Generic.
Does it work? – Evaluation of Tagging
• Tagging/Classification:
• Works well for Travel/Sports
• Bad for Products and Social
sciences
• N Lineages problem:
• Note that all mentions have
multiple lineages in the KB.
• Usually, one IsA lineage goes
to “People” or “Product”
• A ContainedIn lineage goes to
the topic like “SocialScience”
• Detecting which is primary is a
hard problem.
• Is Camera in Photography? Or
Electronics?
• Is War History? Or Politics?
• How far do we go?
Comparison with existing systems
• The first such comparison effort that we know of.
• OpenCalais
– Industrial Entity Extraction system
• StanNER-3: (From Stanford)
– This is a 3-class (Person, Organization, Location) named entity recognizer.
The system uses a CRF-based model which has been trained on a mixture
of CoNLL, MUC and ACE named entity corpora.
• StanNER-3-cl: (From Stanford)
– This is the caseless version of StanNER-3 system which means it ignores
capitalization in text.
• StanNER-4: (From Stanford)
– This is a 4-class (Person, Organization, Location, Misc) named entity
recognizer for English text. This system uses a CRF-based model which
has been trained on the CoNLL corpora.
For People, Organization, Location
• Details in the Paper.
• We are far better on almost all respects:
– Overall: 85% Precision vs 78% best in other systems.
– Overall: 68% Recall vs 40% for StanNER-3 and 28% for OpenCalais
– Significantly better on Organizations
• Why? - Bigger Knowledge Base
– The larger knowledge base allows a more comprehensive
disambiguation.
– Is “Emilie Sloan” referring to a person or organization?
• Why? - Common interjections
– LOL, ROFL, Haha interpreted as organizations by other systems.
– Acronyms misinterpreted
• Vs OpenCalais
– Recall is a major difference with a significantly smaller set of entities
recognized by Open Calais
Q&A
Download