Twitter, Big Data, and Other Ramblings Robert Dittmer Perspective on those V words Volume- 1% of the Twitter stream for roughly one month was about 68 million Tweets. Now multiply that by 100. Facebook has the same problem. Velocity- How do you analyze thousands of points of data in real-time? SQL Server sure isn’t going to do that. Variety- Social Media, Manufacturing, Sales, Financial, CRM, Web Traffic, External Think about what goes into Amazon recommending you a book or movie Veracity- It all means nothing if it’s not at least somewhat clean What do you do with a Tweet? Sentiment Analysis is assigning a numerical value to a word Positive, Negative, Neutral connotation Methods for performing Sentiment Analysis “Dumb” Method- Break down text into individual words and compare with a sentiment dictionary. AKA “Bag of Words” “Smart” Method- Use a natural language processing tool to analyze parts of speech and calculate sentiment based on context Example Tweet “The Apple iPad sucks. The new Google Nexus 7 is awesome!” Collecting Tweets Twitter uses a RESTful service to stream Tweets Steps to start streaming your own Tweets Go to dev.twitter.com and create an application Generate your OAuth credentials Find an open-source Twitter library Tweepy (Python) Tweetinvi (C#) Plug your credentials in and modify the example The Tweet, the Whole Tweet, and Nothing but the Tweet JSON Format (Key-Value Pair) Notable Fields ID CreatedAt Text Entities Hashtags URLs Latitude, Longitude What does a Tweet look like? {"filter_level":"medium","contributors":null,"text":"Iron man 3 was awesome =)","geo":{"type":"Point","coordinates":[50.73529254,4.00720746]},"retweeted":false,"in_reply_to_screen_name":null,"truncated":false,"lang":"en","entities":{"symbols":[],"urls":[],"ha shtags":[],"user_mentions":[]},"in_reply_to_status_id_str":null,"id":330043889589288960,"source":"<a href=\"http://twitter.com/download/android\" rel=\"nofollow\">Twitter for Android<\/a>","in_reply_to_user_id_str":null,"favorited":false,"in_reply_to_status_id":null,"retweet_count":0,"created_at":"Thu May 02 19:39:29 +0000 2013","in_reply_to_user_id":null,"favorite_count":0,"id_str":"330043889589288960","place":{"id":"0613276b16c0d59f","bounding_b ox":{"type":"Polygon","coordinates":[[[-4.335135,50.429347],[-4.335135,50.874614],[-3.732303,50.874614],[3.732303,50.429347]]]},"place_type":"city","name":"West Devon","attributes":{},"country_code":"GB","url":"http://api.twitter.com/1/geo/id/0613276b16c0d59f.json","country":"United Kingdom","full_name":"West Devon, Devon"},"user":{"location":"okehampton","default_profile":false,"statuses_count":1345,"profile_background_tile":true,"lang":"en", "profile_link_color":"FC0AFC","profile_banner_url":"https://si0.twimg.com/profile_banners/503242961/1354459551","id":503242 961,"following":null,"favourites_count":492,"protected":false,"profile_text_color":"0084B4","description":"vicky pollards twin sister ( the nice one )","verified":false,"contributors_enabled":false,"profile_sidebar_border_color":"FFFFFF","name":"vicki phillips ","profile_background_color":"FA03DD","created_at":"Sat Feb 25 16:40:53 +0000 2012","default_profile_image":false,"followers_count":149,"profile_image_url_https":"https://si0.twimg.com/profile_images/30 83034337/fb9a8158c125dbb5a0650f58206880e0_normal.jpeg","geo_enabled":true,"profile_background_image_url":"http://a0.tw img.com/profile_background_images/636836788/xb893f8eb29554020cb59540210c070b.jpg","profile_background_image_url_htt ps":"https://si0.twimg.com/profile_background_images/636836788/xb893f8eb29554020cb59540210c070b.jpg","follow_request_ sent":null,"url":"http://www.facebook.com/vickipolard","utc_offset":0,"time_zone":"Casablanca","notifications":null,"profile_use _background_image":true,"friends_count":1059,"profile_sidebar_fill_color":"DDEEF6","screen_name":"vixoakleophill","id_str":"503 242961","profile_image_url":"http://a0.twimg.com/profile_images/3083034337/fb9a8158c125dbb5a0650f58206880e0_normal.jp eg","listed_count":0,"is_translator":false},"coordinates":{"type":"Point","coordinates":[-4.00720746,50.73529254]}} My Tweet Collection Collected for roughly one month Lots of trial and error Originally used Tweepy, but ran into errors Switched to Tweetinvi and it worked About 68 million Tweets Apple Amazon Google Microsoft Netflix Tesla Ford (Probably should have used a different car company) Yahoo! Finance Detour Use an HTTP request to get stock data http://fiance.yahoo.com/d/quotes.csv?s=AAPL+GOOG+MSFT+YHOO+NFLX+AM ZN+TSLA+F&f=snb2b3opl1t1d1 Create a metric with stock data and compare the sentiment of a company to their performance Big Data (and regular data) Tools Talend Open Studio Hadoop SAP HANA Talend Open Studio Open Source ETL Tool Built on Eclipse Data Quality and Format Issues Even though I saved Tweets in delimited format, issues remained Iterated through all 12,736 files with 5000 tweets each Verified each row against a schema Mapped to different output files Tweet (Fact table) Tracks User Mentions Hashtags URLs Demo Time! Hadoop Overview Based on the Hadoop Distributed File System and MapReduce MapReduce is a way of parallelizing code using batch processing Map finds the data you’re looking for Reduce aggregates that data (count, sum, average) Embarrassingly parallel processing Each server in a Hadoop cluster is referred to as a Node NameNode DataNode Blocks of data are replicated to three nodes Extremely fault tolerant More Hadoop Open-source technology Cloudera vs. Hortonworks Intel, IBM, MapR, Amazon EMR Cloudera and Hortonworks are the two biggest faces of Hadoop Intel actively contributes to optimize it for Xeon Processors IBM and MapR also involved Big companies and entities use it Hadoop Projects Hive Data Warehouse on top of Hadoop Uses HiveQl (essentially SQL with a few extras) to query data Abstracts MapReduce processes Has an ODBC connector to allow it talk to anything that talks to databases Pig Uses a language called Pig Latin to analyze data Data flow language abstracts MapReduce for easy use for data analysts HBase Billions of rows and millions of columns Distributed column data store Hadoop Trivia Time Who created Hadoop? Why is it called Hadoop? Who developed the concept of MapReduce? What does Facebook Messenger use to store its data? Who created Hive? What is Accumulo and who created it? 2nd Generation Hadoop Much faster than previous versions Hive 0.12 is up to 50X faster than previous versions Hortonworks Stinger project aims for 100X performance improvement Projects like Spark are moving towards real-time analysis In-memory cluster compute analysis Streaming processing with routines written in Python and Scala Shark is an implementation of Hive using Spark instead of MapReduce Hadoop Sentiment Analysis Used the “Dumb” method of Sentiment Analysis Import the data into HDFS and create Hive tables Tweet Sentiment Dictionary Explode words in each tweet to create a view with TweetID and Word Join with the Sentiment Dictionary on the word to get sentiment value Demo Time! SAP HANA In-Memory, Column-Store database Loads all data into main-memory Analyze billion of rows with sub-second response time Column-store table structure Allows for much better compression and parallelization than row-store Used for real-time analytics Available with an on premise appliance or cloud-based VM Why is SAP HANA Awesome? Column-stores are naturally very good at parallelization In-Memory means no waiting on IO from disks and is still hundreds of times faster than SSD Feature rich Text analytics Predicative Analytics Library Application Server It is an actual Database and does everything a database does Demo Time SAP HANA Sentiment Analysis Sentiment is calculated when creating a full-text index on the text of the tweet Creates a sentiment value for each tweet Analyze by my different dimensions Aggregate sentiment by hour Demo Time! Other Text Analysis Options Python Natural Language Toolkit Analyze parts of speech and context Should be possible to integrate with Hadoop (The Google did not help) Other Big Data Problems A GE Engine on a transatlantic flight generates 2TB of sensor data There’s four engines on a 747 What does the LHC at CERN do with their 15 petabytes of data they create annually? How does the NSA store a yottabyte of data? How does a small online gaming company analyze their customer base to increase retention and margins? How is Sentiment Analysis Being Used? Companies ingest their social media feeds into these systems If a Tweet or Facebook post meets a certain criteria, an automated or human response can be requested Hot vs. Cold Data Hot data is the recent data you are most interested in Keep this data in SAP HANA for real-time processing Archive it after a period of time: 1 month, 3 months, 6 months, etc… Cold Data is your historical data Data warehouses that can handle massive volumes of data are EXPENSIVE!!!! Use Hadoop and Hive as your data warehouse It only costs the hardware Still able to analyze cold data, store it cheaply, and integrate with SAP HANA