Social Media, Data Integration, and Human Computation AnHai Doan University of Wisconsin @WalmartLabs @WalmartLabs A Journey Starting in 2001 ... Worked in data integration – combine multiple data sources into one – e.g, aggregation/comparison shopping sites, Google Scholar homes.com realestate.com Find houses with 2 bedrooms under 400K fsbo.com – use schema matching, information extraction, entity disambiguation Ph.D. thesis focused on schema matching 2 Schema Matching address price 31 Bagley Ct ... 250K 12 Hope St ... 375K location sold-at 14 Main St ... 249,000 25 West St ... 324,000 address = location price = sold-at Developed automatic solution using machine learning Realized that automatic solutions are not good enough – only 65-85% accuracy – need human intervention Proposed a crowdsourcing approach 3 Crowdsourced Schema Matching address price 31 Bagley Ct ... 250K 12 Hope St ... 375K address = location Yes, Yes, No location sold-at 14 Main St ... 249,000 25 West St ... 324,000 Can crowdsource other DI tasks too Difficult to publish Build a large-scale DI system on the Web – Building data integration systems via mass collaboration, WebDB-03 thatreviews: crowdsourcing isbelieve practical –Show Subsequent great work, I don’t it, neutral 4 Started DBLife Project in 2005 HV Jagadish Researcher Homepages Conference Superpages Keyword search Web pages ** * * Pages Group Pages HV Jagadish * * * ** DBworld ** * SIGMOD-07 give-talk SIGMOD-07 ** * SQL querying Question answering Browse mailing list Mining DBLP Alert/Monitor News summary File system RDBMS Hadoop Example Superpage 6 Example Crowdsourcing Picture is removed if enough users vote “no”. 7 Project Status in 2009 Data integration – – – – overall methodology: VLDB-07a, VLDB-07b, CIDR-09 DI operators: VLDB-07c optimization: VLDB-07c, SIGMOD-08, ICDE-08a, SIGMOD-09a provenance/others: ICDE-07a, ICDE-07b, VLDB-08a Crowdsourcing / human computation – – – – schema matching: ICDE-08b best-effort information extraction: SIGMOD-08 human feedback into the DI pipeline: SIGMOD-09b how lay users can query the database: SIGMOD-09c System to development Wanted know what’s going on in industry – hard to build/maintain systems in academia Wanted to take DBLife to the next level Joined Kosmix in 2010 to do “DBLife on steroids” 8 Kosmix Founded by Anand Rajaraman & Venky Harinarayan – formerly of Junglee, sold to Amazon for 250M 55M in funding, 30+ engineers Integrated Web data sources into a giant taxonomy all IMDB Musicbrainz Tripadvisor Wikipedia … Information extraction Entity disambiguation Entity merging ... places people actors topic pages Angelia Jolie Mel Gibson File system RDBMS Hadoop 9 Raised many interesting challenges - e.g., incremental updates, recycling human edits Very good in certain topics (e.g., health) But hard to compete with Google and Wikipedia Switched to social media in early 2010 10 Social Media Exploding • • • • 11 100 million perwe day Every two tweets days now create as much 1 information billion Facebook per the day dawn of as weshares did from civilization up until 2003. 1.5 million Foursquare checkins per day -- Eric Schmidt 40,000 Flickr photos per second Switching Made Much Business Sense Lot of social media data Lot of people using it, spending a lot of time on it – lot of links now come from social media, not search engines – Google is worried (hence Buzz, Google+, Google++) New level playing field Have a secret weapon: the giant taxonomy Next hot Internet wave – SoLoMo = social + local + mobile But can we build interesting applications? What is social media good for? 12 From Frivolous to Serious 95% of tweets is still junk – I feel good today Help teenagers track Justin Bieber – the background noise of Twitter Charlie Sheen, celebrity fighting, Weiner losing his job Foster customer relationships – follow your dentist Spread news Manage disasters Promote e-commerce Help organize events, movements – revolutions 13 Lot of Companies / Actions in This Space Build platforms for social media – how to tweet more effectively Understand social media – social analytics / route relevant information to users Use social media to make predictions Use social media to affect real-world changes Mostly operate at the keyword level – how many times the keyword “Obama” has been mentioned today? Kosmix: the leader in performing semantic analysis – how many times the entity President Obama has been mentioned today? – “Obama”, “Barack”, “Barry”, “BO”, “the Pres”, “the Messiah”, ... Kosmix Solution Crowd sourcing internal analysts, users, Mechanical Turks, others IMDB Musicbrainz Wikipedia … Social Genome Applications Information extraction Entity disambiguation Entity merging Schema matching Event detection Event monitoring ... Highly scalable real-time infrastructure File system RDBMS Hadoop Muppet Slates Stream servers Social Genome all places people Twitter users FB users @melgibson @dsmith … actors mel-gibson davesmith … Angelia Jolie Mel Gibson tweet-about the-same-as events @dsmith: Mel crashed. Maserati is gone. sports celebrities politics … Gibson car crash Egypt Egyptian uprising capital-of Cairo related-to located-in Tahrir @far213: Tahrir is packed! Building Social Genome: Three Sample Challenges all places people Twitter users FB users @melgibson @dsmith … actors mel-gibson davesmith … Angelia Jolie Mel Gibson tweet-about the-same-as events @dsmith: Mel crashed. Maserati is gone. sports celebrities politics … Gibson car crash Egypt Egyptian uprising capital-of Cairo related-to located-in Tahrir @far213: Tahrir is packed! Extraction and Disambiguation: Traditional Methods Ill Suited for Social Media all places events people actors sports directors Angelia Jolie Mel Gibson Mel was arrested again. What a dramatic fall since his Oscar-winning day. Mel Brooks celebrities Gibson car crash Extraction use rule-based / NLP / machine learning techniques @dsmith: mel crashed. maserati is gone. Extraction use dictionaries use rules politics … Egyptian uprising Disambiguation Long-term, Web context: actor, movie, Oscar, Hollywood Disambiguation Short-term, social context: crash, car, Maserati Must Maintain a Highly Dynamic Social Genome all places events people actors sports directors Angelia Jolie Mel Gibson Mel Brooks celebrities Gibson car crash politics … Egyptian uprising Short-term, social context: crash, car, Maserati Long-term, Web context: actor, movie, Oscar, Hollywood Latency less than 2 seconds 20 The Giant Traditional Taxonomy is the Secret Weapon all Egypt capital-of places people Cairo actors located-in Tahrir Angelia Jolie Mel Gibson Without it, dictionary-based extraction is not possible Provide a framework to – “understand” social media, find related concepts, “hang” social contexts Very hard to develop, takes years – like learning a new foreign language Partly explains why it was hard for others to catch up Must integrate traditional data well, then bootstrap Event Detection: Current Solutions Twitter 4square Facebook Myspace Flickr … events Event detection sports celebrities Gibson car crash politics … Egyptian uprising • Focus on Twitter + Foursquare • Lot of current work in academia / industry • Limitations of most of the current solutions – exploit just one kind of heuristics • e.g., find popular, strongly correlated words (Egypt, revolt) – does not exploit crowdsourcing – does not scale • not designed explicitly for parallelism Event Dection: Kosmix Solution Detector 1 Twitter Foursquare Detector 2 … Detector n Candidate events Candidate events Population 1 Event evaluator and ranker Ranked events Candidate events Hadoop Population 2 Population 3 ... Muppet Slates Stream servers Event Monitoring: Current Solutions Egyptian uprising @far213: Tahrir is packed! Baltimore shooting @dsmith: Baltimore shooting on TV5! • Manually write rules to match tweets to events – e.g., tweet contains certain keywords / userids positive – conceptually simple, relatively easy to implement – often achieve high initial precision • Limitations – expensive, don’t scale – manually writing good rules can be hard – rules often become invalid/inadequate over time • e.g., Baltimore shooting John Hopkins shooting 24 Event Monitoring: Kosmix Solution Event Twitter firehose Baltimore shooting Initial profile {Baltimore, shoot} Learning algorithm New profile {Baltimore, shoot, John Hopkins} 25 Tweets “Baltimore shooting on TV5!” “Baltimore shooting. John Hopkins shut down.” ... Social Analytics with The NYTimes e.g. Location, Sentiment, Entity extraction, etc. Tweets Annotators Tweets & Dimensions Location How many are tweeting about Barack Obama in New York, by the minute for last 60 mins, by hour for last 24 hours, and by day for last 10 days? Barack Obama Hillary Clinton Topics Medicare How many feel negative of Barack Obama across the US? SocialCubes Negative Positive How many people in Arizona feel positive of the new Medicare plan? Neutral Barack Obama, President Obama, the Pres, Barry, BO, ... Sentiment Stats Social Monitoring with an Unknown Agency 146 in past 5 mins 3267 in past 12 hours Twitter firehose Justin Bieber Charlie Sheen Egyptian uprising Count tweets related to Wael Ghonim Jordan unrest China unrest North Tibet West Southeast Bought by Walmart in May 2011 The Walmart Acquisition Deal reported to be 250-300M Kosmix became @WalmartLabs – based in San Bruno – local office in India – plan new offices in China and Brazil 100 persons today, actively hiring 29 Why? 400+ B in revenue, only 5-10B online vs. 34B of Amazon Major problems if won’t catch up within 5-10 years – see Borders @WalmartLabs can help in many ways – Provides a core of technical people, attract more – Improve traditional e-commerce – SEO, SEM, search on walmart.com – build a vast product taxonomy – Helps build the e-commerce of the future – social, local, and mobile – a good way to catch up and leapfrog Amazon 30 Improve Traditional E-Commerce all products Product data from thousands of vendors Information extraction Entity disambiguation Entity merging ... books cars US cars In-house data Ford search ads Chevrolet Web data File system RDBMS Hadoop 31 Help Build the E-Commerce of Future: Social, Local, and Mobile O2O (Online 2 Offline) emerging as a major trend – increasingly tighter integration of online and offline parts – e.g., Groupon, Living Social Social, local, and mobile commerce examples – gift recommendation: – “I love salt!” – “Your friend has just tweeted about the movie SALT. Would you like to buy something related for her birthday?” – personalized “Groupon” with vendors: – “You seem to be interested in gourmet coffee. If 50 persons sign up to buy the new DeLonghi coffee maker, you can get that for a 50% discount.” – stocking a local store – a Siri-like shopping assistant 32 Wrapping Up Social media has become a major frontier on Web Integrating social data is fundamentally much harder than integrating “traditional” data – – – – – – lack of context dynamic environment, new concepts appear quickly quality issues, lots of spam quick spread of information, user activities fast data solution will change over time, need human in the loop to monitor Must integrate “traditional” data well, then bootstrap – giant taxonomy critical Crowdsourcing becomes indispensible – but raises interesting challenges