Social Media, Data Integration, and Human Computation

advertisement
Social Media, Data Integration, and
Human Computation
AnHai Doan
University of Wisconsin
@WalmartLabs
@WalmartLabs
A Journey Starting in 2001 ...

Worked in data integration
– combine multiple data sources into one
– e.g, aggregation/comparison shopping sites, Google Scholar
homes.com
realestate.com
Find houses with
2 bedrooms
under 400K
fsbo.com
– use schema matching, information extraction, entity disambiguation

Ph.D. thesis focused on schema matching
2
Schema Matching


address
price
31 Bagley Ct ...
250K
12 Hope St ...
375K
location
sold-at
14 Main St ...
249,000
25 West St ...
324,000
address = location
price = sold-at
Developed automatic solution using machine learning
Realized that automatic solutions are not good enough
– only 65-85% accuracy
– need human intervention

Proposed a crowdsourcing approach
3
Crowdsourced Schema Matching
address
price
31 Bagley Ct ...
250K
12 Hope St ...
375K
address = location Yes, Yes, No
location
sold-at
14 Main St ...
249,000
25 West St ...
324,000


Can crowdsource other DI tasks too
Difficult to publish
Build a large-scale DI system on the Web
– Building data integration systems via mass collaboration, WebDB-03
thatreviews:
crowdsourcing
isbelieve
practical
–Show
Subsequent
great work, I don’t
it, neutral
4
Started DBLife Project in 2005
HV Jagadish
Researcher
Homepages
Conference
Superpages
Keyword search
Web pages
**
*
*
Pages
Group Pages
HV Jagadish
* *
*
**
DBworld
**
*
SIGMOD-07
give-talk
SIGMOD-07
**
*
SQL querying
Question
answering
Browse
mailing list
Mining
DBLP
Alert/Monitor
News summary
File system
RDBMS
Hadoop
Example Superpage
6
Example Crowdsourcing
Picture is removed if enough users vote “no”.
7
Project Status in 2009

Data integration
–
–
–
–

overall methodology: VLDB-07a, VLDB-07b, CIDR-09
DI operators: VLDB-07c
optimization: VLDB-07c, SIGMOD-08, ICDE-08a, SIGMOD-09a
provenance/others: ICDE-07a, ICDE-07b, VLDB-08a
Crowdsourcing / human computation
–
–
–
–
schema matching: ICDE-08b
best-effort information extraction: SIGMOD-08
human feedback into the DI pipeline: SIGMOD-09b
how lay users can query the database: SIGMOD-09c

System to
development
Wanted
know what’s going on in industry
– hard to build/maintain systems in academia
Wanted to take DBLife to the next level
Joined Kosmix in 2010 to do “DBLife on steroids”
8
Kosmix

Founded by Anand Rajaraman & Venky Harinarayan
– formerly of Junglee, sold to Amazon for 250M


55M in funding, 30+ engineers
Integrated Web data sources into a giant taxonomy
all
IMDB
Musicbrainz
Tripadvisor
Wikipedia
…
Information extraction
Entity disambiguation
Entity merging ...
places
people
actors
topic
pages
Angelia Jolie Mel Gibson
File system
RDBMS
Hadoop
9
Raised many interesting challenges
- e.g., incremental updates,
recycling human edits
Very good in certain topics (e.g., health)
But hard to compete with Google and Wikipedia
Switched to social media in early 2010
10
Social Media Exploding
•
•
•
•
11
100
million
perwe
day
Every
two tweets
days now
create as much
1 information
billion Facebook
per the
day dawn of
as weshares
did from
civilization
up until 2003.
1.5
million Foursquare
checkins per day
-- Eric Schmidt
40,000 Flickr photos per second
Switching Made Much Business Sense


Lot of social media data
Lot of people using it, spending a lot of time on it
– lot of links now come from social media, not search engines
– Google is worried (hence Buzz, Google+, Google++)

New level playing field
 Have a secret weapon: the giant taxonomy
 Next hot Internet wave
– SoLoMo = social + local + mobile

But can we build interesting applications?
What is social media good for?
12
From Frivolous to Serious

95% of tweets is still junk
– I feel good today

Help teenagers track Justin Bieber
– the background noise of Twitter


Charlie Sheen, celebrity fighting, Weiner losing his job
Foster customer relationships
– follow your dentist




Spread news
Manage disasters
Promote e-commerce
Help organize events,
movements
– revolutions
13
Lot of Companies / Actions in This Space

Build platforms for social media
– how to tweet more effectively

Understand social media
– social analytics / route relevant information to users


Use social media to make predictions
Use social media to affect real-world changes

Mostly operate at the keyword level
– how many times the keyword “Obama” has been mentioned today?

Kosmix: the leader in performing semantic analysis
– how many times the entity President Obama has been mentioned
today?
– “Obama”, “Barack”, “Barry”, “BO”, “the Pres”, “the Messiah”, ...
Kosmix Solution
Crowd sourcing
internal analysts, users, Mechanical Turks, others
IMDB
Musicbrainz
Wikipedia
…
Social Genome
Applications
Information extraction
Entity disambiguation
Entity merging
Schema matching
Event detection
Event monitoring
...
Highly scalable real-time infrastructure
File system
RDBMS
Hadoop
Muppet
Slates Stream servers
Social Genome
all
places
people
Twitter users
FB users
@melgibson @dsmith …
actors
mel-gibson davesmith …
Angelia Jolie Mel Gibson
tweet-about
the-same-as
events
@dsmith: Mel crashed.
Maserati is gone.
sports
celebrities
politics …
Gibson car crash
Egypt
Egyptian uprising
capital-of
Cairo
related-to
located-in
Tahrir
@far213: Tahrir is packed!
Building Social Genome: Three Sample Challenges
all
places
people
Twitter users
FB users
@melgibson @dsmith …
actors
mel-gibson davesmith …
Angelia Jolie Mel Gibson
tweet-about
the-same-as
events
@dsmith: Mel crashed.
Maserati is gone.
sports
celebrities
politics …
Gibson car crash
Egypt
Egyptian uprising
capital-of
Cairo
related-to
located-in
Tahrir
@far213: Tahrir is packed!
Extraction and Disambiguation:
Traditional Methods Ill Suited for Social Media
all
places
events
people
actors
sports
directors
Angelia Jolie Mel Gibson
Mel was arrested again.
What a dramatic fall since
his Oscar-winning day.
Mel Brooks
celebrities
Gibson car crash
Extraction
use rule-based / NLP /
machine learning techniques
@dsmith: mel crashed.
maserati is gone.
Extraction
use dictionaries
use rules
politics …
Egyptian uprising
Disambiguation
Long-term, Web context:
actor, movie, Oscar, Hollywood
Disambiguation
Short-term, social context:
crash, car, Maserati
Must Maintain a Highly Dynamic Social Genome
all
places
events
people
actors
sports
directors
Angelia Jolie Mel Gibson
Mel Brooks
celebrities
Gibson car crash
politics …
Egyptian uprising
Short-term, social context:
crash, car, Maserati
Long-term, Web context:
actor, movie, Oscar, Hollywood
Latency less than 2 seconds
20
The Giant Traditional Taxonomy is
the Secret Weapon
all
Egypt
capital-of
places
people
Cairo
actors
located-in
Tahrir


Angelia Jolie Mel Gibson
Without it, dictionary-based extraction is not possible
Provide a framework to
– “understand” social media, find related concepts, “hang” social contexts

Very hard to develop, takes years
– like learning a new foreign language

Partly explains why it was hard for others to catch up
 Must integrate traditional data well, then bootstrap
Event Detection: Current Solutions
Twitter
4square
Facebook
Myspace
Flickr
…
events
Event detection
sports
celebrities
Gibson car crash
politics …
Egyptian uprising
• Focus on Twitter + Foursquare
• Lot of current work in academia / industry
• Limitations of most of the current solutions
– exploit just one kind of heuristics
• e.g., find popular, strongly correlated words (Egypt, revolt)
– does not exploit crowdsourcing
– does not scale
• not designed explicitly for parallelism
Event Dection: Kosmix Solution
Detector 1
Twitter
Foursquare
Detector 2
…
Detector n
Candidate
events
Candidate
events
Population 1
Event
evaluator
and
ranker
Ranked
events
Candidate
events
Hadoop
Population 2
Population 3
...
Muppet
Slates Stream servers
Event Monitoring: Current Solutions
Egyptian uprising
@far213: Tahrir is packed!
Baltimore shooting
@dsmith: Baltimore shooting on TV5!
• Manually write rules to match tweets to events
– e.g., tweet contains certain keywords / userids  positive
– conceptually simple, relatively easy to implement
– often achieve high initial precision
• Limitations
– expensive, don’t scale
– manually writing good rules can be hard
– rules often become invalid/inadequate over time
• e.g., Baltimore shooting  John Hopkins shooting
24
Event Monitoring: Kosmix Solution
Event
Twitter firehose
Baltimore
shooting
Initial profile
{Baltimore, shoot}
Learning algorithm
New profile
{Baltimore, shoot, John Hopkins}
25
Tweets
“Baltimore shooting on TV5!”
“Baltimore shooting. John Hopkins
shut down.”
...
Social Analytics with The NYTimes
e.g. Location, Sentiment,
Entity extraction, etc.
Tweets
Annotators
Tweets
& Dimensions
Location
How many are tweeting about
Barack Obama in New York, by
the minute for last 60 mins, by hour
for last 24 hours, and by day for
last 10 days?
Barack Obama
Hillary Clinton
Topics
Medicare
How many feel
negative of Barack
Obama across the
US?
SocialCubes
Negative
Positive
How many
people in Arizona
feel positive of
the new
Medicare plan?
Neutral
Barack Obama, President Obama, the Pres, Barry, BO, ...
Sentiment
Stats
Social Monitoring with an Unknown Agency
146 in past 5 mins
3267 in past 12 hours
Twitter firehose
Justin Bieber
Charlie Sheen
Egyptian
uprising
Count tweets
related to Wael Ghonim
Jordan unrest
China unrest
North
Tibet
West
Southeast
Bought by Walmart in May 2011
The Walmart Acquisition


Deal reported to be
250-300M
Kosmix became
@WalmartLabs
– based in San Bruno
– local office in India
– plan new offices in
China and Brazil

100 persons today,
actively hiring
29
Why?


400+ B in revenue, only 5-10B online vs. 34B of Amazon
Major problems if won’t catch up within 5-10 years
– see Borders

@WalmartLabs can help in many ways
– Provides a core of technical people, attract more
– Improve traditional e-commerce
– SEO, SEM, search on walmart.com
– build a vast product taxonomy
– Helps build the e-commerce of the future
– social, local, and mobile
– a good way to catch up and leapfrog Amazon
30
Improve Traditional E-Commerce
all products
Product
data from
thousands
of vendors
Information extraction
Entity disambiguation
Entity merging ...
books
cars
US cars
In-house data
Ford
search
ads
Chevrolet
Web data
File system
RDBMS
Hadoop
31
Help Build the E-Commerce of Future:
Social, Local, and Mobile

O2O (Online 2 Offline) emerging as a major trend
– increasingly tighter integration of online and offline parts
– e.g., Groupon, Living Social

Social, local, and mobile commerce examples
– gift recommendation:
– “I love salt!”
– “Your friend has just tweeted about the movie SALT. Would you
like to buy something related for her birthday?”
– personalized “Groupon” with vendors:
– “You seem to be interested in gourmet coffee. If 50 persons sign
up to buy the new DeLonghi coffee maker, you can get that for a
50% discount.”
– stocking a local store
– a Siri-like shopping assistant
32
Wrapping Up


Social media has become a major frontier on Web
Integrating social data is fundamentally much harder
than integrating “traditional” data
–
–
–
–
–
–

lack of context
dynamic environment, new concepts appear quickly
quality issues, lots of spam
quick spread of information, user activities
fast data
solution will change over time, need human in the loop to monitor
Must integrate “traditional” data well, then bootstrap
– giant taxonomy critical

Crowdsourcing becomes indispensible
– but raises interesting challenges
Download