Social Media, Data Integration, and Human Computation AnHai Doan University of Wisconsin @WalmartLabs @WalmartLabs Background Professor at University of Wisconsin-Madison In 2010 took unpaid leave and joined Kosmix – Bay-area startup, did semantic analysis of social media Acquired by Walmart in 2011, became WalmartLabs – Based in San Bruno, local office in India, hundreds of people Why did Walmart buy a social-media startup? – Wanted to catch up with Amazon (<10B online vs. >35B of Amazon) – Major problems if don’t get close in 10 years (see Borders) – Kosmix/WalmartLabs helps in many ways – Provides a core of technical people, help attract more – Improves traditional e-commerce – Builds the e-commerce of the future : Social + Local + Mobile 2 Major R&D Groups at WalmartLabs Search and Products Polaris Giant product catalog Product intelligence Demand Generation SEO, SEM Customer targeting and personalization Social, Mobile, and Local E-Commerce Mining social data Stores + Mobile Build social/mobile apps (get on the self, gift recommendation, etc.) Big Fast Data Large-scale Machine Learning Data Extraction & Integration Crowdsourcing Social Genome Special Initiatives 3 Social Genome Mine everything we can out of social data – From tweets, FB feeds, Foursquare, blogs, etc. – Mine users, organizations, products, sentiments, events, etc. Connect them to those in the traditional Web world Put them into a giant knowledge base – Big, evolve rapidly over time – Call this “social genome” Use social genome to power multiple e-commerce applications – – – – – Search Product intelligence Gift recommendation Personalized “Groupon” Etc. 4 Social Genome all places people Twitter users FB users @melgibson @dsmith … actors mel-gibson davesmith … Angelia Jolie Mel Gibson tweet-about the-same-as events @dsmith: Mel crashed. Maserati is gone. sports celebrities politics … Gibson car crash Egypt Egyptian uprising capital-of Cairo related-to located-in Tahrir @far213: Tahrir is packed! Building Social Genome: Three Sample Challenges all places people Twitter users FB users @melgibson @dsmith … actors mel-gibson davesmith … Angelia Jolie Mel Gibson tweet-about the-same-as 1 events @dsmith: Mel crashed. Maserati is gone. sports celebrities 2 politics … Gibson car crash Egypt Egyptian uprising capital-of Cairo related-to located-in 3 Tahrir @far213: Tahrir is packed! Extraction and Disambiguation: Traditional Methods Ill Suited for Social Media all places events people actors sports professors Angelia Jolie Mel Gibson Mel was arrested again. What a dramatic fall since his Oscar-winning day. Mel Brocks celebrities Gibson car crash Extraction use rule-based / NLP / machine learning techniques @dsmith: mel crashed. maserati is gone. politics … Egyptian uprising Disambiguation Long-term, Web context: actor, movie, Oscar, Hollywood Extraction Disambiguation use dictionaries Short-term, social context: crash, car, Maserati Must Maintain a Highly Dynamic Social Genome all places events people actors sports professors Angelia Jolie Mel Gibson Mel Brocks celebrities Gibson car crash politics … Egyptian uprising Short-term, social context: crash, car, Maserati Long-term, Web context: actor, movie, Oscar, Hollywood Latency less than 2 seconds, Maintained using a fast-data processing system 9 The Giant Traditional Taxonomy is the Secret Weapon all Egypt located-in Tahrir capital-of Cairo places people actors Angelia Jolie Mel Gibson Without it, dictionary-based extraction is not possible Provide a framework to – “understand” social media, find related concepts, “hang” social contexts Very hard to develop, takes years – Integrate data from multiple sources, like learning a foreign language Partly explains why it was hard for others to catch up To integrate social media, must integrate traditional data well, then bootstrap Context is also Absolutely Critical Alice lives in NYC Alice tweets Go Giants! SF Giants Entity Extraction Context/ Disambiguation ? NY Giants NY Giants Bob likes Buster Posey (SF Giants player) Bob tweets Go Giants! SF Giants Entity Extraction ? Context/ Disambiguation SF Giants NY Giants Charlie tweeted on Feb 4th (day before the Super Bowl (event) – the Web is talking about the NY Giants) Charlie tweets Go Giants! SF Giants Entity Extraction 11 –Social @Walmart Labs ? NY Giants Context/ Disambiguation NY Giants Building Social Genome: Three Sample Challenges all places people Twitter users FB users @melgibson @dsmith … actors mel-gibson davesmith … Angelia Jolie Mel Gibson tweet-about the-same-as 1 events @dsmith: Mel crashed. Maserati is gone. sports celebrities 2 politics … Gibson car crash Egypt Egyptian uprising capital-of Cairo related-to located-in 3 Tahrir @far213: Tahrir is packed! Event Detection: Current Solutions Twitter 4square Facebook Myspace Flickr … events Event detection sports celebrities politics … Gibson car crash • Lot of current work in academia / industry • Limitations of most of the current solutions – exploit just one kind of heuristics • e.g., find hot, trending, popular words (Egypt, revolt) – does not exploit crowdsourcing – does not scale Egyptian uprising Event Dection: Our Solution Detector 1 Twitter Foursquare Detector 2 … Detector n Candidate events Candidate events Crowdsourcing Population 1 Event evaluator and ranker Ranked events Candidate events Muppet, a platform to process fast data over multiple machines Crowdsourcing Population 2 Crowdsourcing Population 3 ... Processing Fast Data Big data management is well known by now – use MapReduce implementations – simple programming model, widespread adoption But a lot of fast data is also emerging – 150 M tweets / day, 1 billion FB shares / day, 3 M Foursquare checkins / day – come into the system as very fast streams Numerous applications over these streams Need to process in real time – to answer “what is happening now?” Processing Fast Data What we want: a platform that – delivers real-time processing (over multiple machines) – is highly scalable (as the data gets faster and faster) – has simple programming model – so developers can quickly write hundreds of apps – ideally like map-reduce, which developers already know – has real-time query and storage capability – apps can query content in real-time – distributed across multiple machines Answer: Muppet, like Map-Reduce, but for fast data – see “MapReduce-Style Processing of Fast Data”, VLDB-12 Using the Social Genome Gift recommendation: – “I love salt!” – “Your friend has just tweeted about the movie SALT. Would you like to buy something related for her birthday?” 17 Using the Social Genome Search query expansion – “Advil” “advil headache cramp” Personalized “Groupon” with vendors – “You seem to be interested in gourmet coffee. If 50 persons sign up to buy the new DeLonghi coffee maker, you can get that for a 50% discount.” Stocking a local store – Lot of people in Mountain View are interested in outdoor sport – Stock up local Walmart store with related products A Siri-like shopping assistant 18 Wrapping Up The future of e-commerce: social, mobile, and local Retailers must increasingly be data / Web players Social media is important for e-commerce Integrating social data is fundamentally much harder than integrating “traditional” data – – – – lack of context dynamic environment, new concepts appear quickly quality issues, lots of spam fast data Must integrate “traditional” data well, then bootstrap – giant taxonomy critical Crowdsourcing becomes indispensible – but raises interesting challenges