Twitter Analysis

Twitter, Big Data, and
Other Ramblings
Robert Dittmer
Perspective on those V words
Volume- 1% of the Twitter stream for roughly one month was about 68 million
Tweets. Now multiply that by 100. Facebook has the same problem.
Velocity- How do you analyze thousands of points of data in real-time? SQL
Server sure isn’t going to do that.
Variety- Social Media, Manufacturing, Sales, Financial, CRM, Web Traffic,
Think about what goes into Amazon recommending you a book or movie
Veracity- It all means nothing if it’s not at least somewhat clean
What do you do with a Tweet?
Sentiment Analysis is assigning a numerical value to a word
Positive, Negative, Neutral connotation
Methods for performing Sentiment Analysis
“Dumb” Method- Break down text into individual words and compare with a
sentiment dictionary. AKA “Bag of Words”
“Smart” Method- Use a natural language processing tool to analyze parts of speech
and calculate sentiment based on context
Example Tweet
“The Apple iPad sucks. The new Google Nexus 7 is awesome!”
Collecting Tweets
Twitter uses a RESTful service to stream Tweets
Steps to start streaming your own Tweets
Go to and create an application
Generate your OAuth credentials
Find an open-source Twitter library
Tweepy (Python)
Tweetinvi (C#)
Plug your credentials in and modify the example
The Tweet, the Whole Tweet, and
Nothing but the Tweet
JSON Format (Key-Value Pair)
Notable Fields
Latitude, Longitude
What does a Tweet look like?
{"filter_level":"medium","contributors":null,"text":"Iron man 3 was awesome
href=\"\" rel=\"nofollow\">Twitter for
May 02 19:39:29 +0000
Kingdom","full_name":"West Devon,
961,"following":null,"favourites_count":492,"protected":false,"profile_text_color":"0084B4","description":"vicky pollards twin
sister ( the nice one )","verified":false,"contributors_enabled":false,"profile_sidebar_border_color":"FFFFFF","name":"vicki
phillips ","profile_background_color":"FA03DD","created_at":"Sat Feb 25 16:40:53 +0000
My Tweet Collection
Collected for roughly one month
Lots of trial and error
Originally used Tweepy, but ran into errors
Switched to Tweetinvi and it worked
About 68 million Tweets
Ford (Probably should have used a different car company)
Yahoo! Finance Detour
Use an HTTP request to get stock data
Create a metric with stock data and compare the sentiment of a company to
their performance
Big Data (and regular data) Tools
Talend Open Studio
Talend Open Studio
Open Source ETL Tool
Built on Eclipse
Data Quality and Format Issues
Even though I saved Tweets in delimited format, issues remained
Iterated through all 12,736 files with 5000 tweets each
Verified each row against a schema
Mapped to different output files
Tweet (Fact table)
User Mentions
Demo Time!
Hadoop Overview
Based on the Hadoop Distributed File System and MapReduce
MapReduce is a way of parallelizing code using batch processing
Map finds the data you’re looking for
Reduce aggregates that data (count, sum, average)
Embarrassingly parallel processing
Each server in a Hadoop cluster is referred to as a Node
Blocks of data are replicated to three nodes
Extremely fault tolerant
More Hadoop
Open-source technology
Cloudera vs. Hortonworks
Intel, IBM, MapR, Amazon EMR
Cloudera and Hortonworks are the two biggest faces of Hadoop
Intel actively contributes to optimize it for Xeon Processors
IBM and MapR also involved
Big companies and entities use it
Hadoop Projects
Data Warehouse on top of Hadoop
Uses HiveQl (essentially SQL with a few extras) to query data
Abstracts MapReduce processes
Has an ODBC connector to allow it talk to anything that talks to databases
Uses a language called Pig Latin to analyze data
Data flow language abstracts MapReduce for easy use for data analysts
Billions of rows and millions of columns
Distributed column data store
Hadoop Trivia Time
Who created Hadoop?
Why is it called Hadoop?
Who developed the concept of MapReduce?
What does Facebook Messenger use to store its data?
Who created Hive?
What is Accumulo and who created it?
2nd Generation Hadoop
Much faster than previous versions
Hive 0.12 is up to 50X faster than previous versions
Hortonworks Stinger project aims for 100X performance improvement
Projects like Spark are moving towards real-time analysis
In-memory cluster compute analysis
Streaming processing with routines written in Python and Scala
Shark is an implementation of Hive using Spark instead of MapReduce
Hadoop Sentiment Analysis
Used the “Dumb” method of Sentiment Analysis
Import the data into HDFS and create Hive tables
Sentiment Dictionary
Explode words in each tweet to create a view with TweetID and Word
Join with the Sentiment Dictionary on the word to get sentiment value
Demo Time!
In-Memory, Column-Store database
Loads all data into main-memory
Analyze billion of rows with sub-second response time
Column-store table structure
Allows for much better compression and parallelization than row-store
Used for real-time analytics
Available with an on premise appliance or cloud-based VM
Why is SAP HANA Awesome?
Column-stores are naturally very good at parallelization
In-Memory means no waiting on IO from disks and is still hundreds of times
faster than SSD
Feature rich
Text analytics
Predicative Analytics Library
Application Server
It is an actual Database and does everything a database does
Demo Time
SAP HANA Sentiment Analysis
Sentiment is calculated when creating a full-text index on the text of the
Creates a sentiment value for each tweet
Analyze by my different dimensions
Aggregate sentiment by hour
Demo Time!
Other Text Analysis Options
Python Natural Language Toolkit
Analyze parts of speech and context
Should be possible to integrate with Hadoop (The Google did not help)
Other Big Data Problems
A GE Engine on a transatlantic flight generates 2TB of sensor data
There’s four engines on a 747
What does the LHC at CERN do with their 15 petabytes of data they create
How does the NSA store a yottabyte of data?
How does a small online gaming company analyze their customer base to
increase retention and margins?
How is Sentiment Analysis Being Used?
Companies ingest their social media feeds into these systems
If a Tweet or Facebook post meets a certain criteria, an automated or human
response can be requested
Hot vs. Cold Data
Hot data is the recent data you are most interested in
Keep this data in SAP HANA for real-time processing
Archive it after a period of time: 1 month, 3 months, 6 months, etc…
Cold Data is your historical data
Data warehouses that can handle massive volumes of data are EXPENSIVE!!!!
Use Hadoop and Hive as your data warehouse
It only costs the hardware
Still able to analyze cold data, store it cheaply, and integrate with SAP HANA