Twitter Analysis

advertisement
Twitter, Big Data, and
Other Ramblings
Robert Dittmer
Perspective on those V words

Volume- 1% of the Twitter stream for roughly one month was about 68 million
Tweets. Now multiply that by 100. Facebook has the same problem.

Velocity- How do you analyze thousands of points of data in real-time? SQL
Server sure isn’t going to do that.

Variety- Social Media, Manufacturing, Sales, Financial, CRM, Web Traffic,
External


Think about what goes into Amazon recommending you a book or movie
Veracity- It all means nothing if it’s not at least somewhat clean
What do you do with a Tweet?

Sentiment Analysis is assigning a numerical value to a word



Positive, Negative, Neutral connotation
Methods for performing Sentiment Analysis

“Dumb” Method- Break down text into individual words and compare with a
sentiment dictionary. AKA “Bag of Words”

“Smart” Method- Use a natural language processing tool to analyze parts of speech
and calculate sentiment based on context
Example Tweet

“The Apple iPad sucks. The new Google Nexus 7 is awesome!”
Collecting Tweets

Twitter uses a RESTful service to stream Tweets

Steps to start streaming your own Tweets

Go to dev.twitter.com and create an application

Generate your OAuth credentials

Find an open-source Twitter library


Tweepy (Python)

Tweetinvi (C#)
Plug your credentials in and modify the example
The Tweet, the Whole Tweet, and
Nothing but the Tweet

JSON Format (Key-Value Pair)

Notable Fields

ID

CreatedAt

Text

Entities


Hashtags

URLs
Latitude, Longitude
What does a Tweet look like?

{"filter_level":"medium","contributors":null,"text":"Iron man 3 was awesome
=)","geo":{"type":"Point","coordinates":[50.73529254,4.00720746]},"retweeted":false,"in_reply_to_screen_name":null,"truncated":false,"lang":"en","entities":{"symbols":[],"urls":[],"ha
shtags":[],"user_mentions":[]},"in_reply_to_status_id_str":null,"id":330043889589288960,"source":"<a
href=\"http://twitter.com/download/android\" rel=\"nofollow\">Twitter for
Android<\/a>","in_reply_to_user_id_str":null,"favorited":false,"in_reply_to_status_id":null,"retweet_count":0,"created_at":"Thu
May 02 19:39:29 +0000
2013","in_reply_to_user_id":null,"favorite_count":0,"id_str":"330043889589288960","place":{"id":"0613276b16c0d59f","bounding_b
ox":{"type":"Polygon","coordinates":[[[-4.335135,50.429347],[-4.335135,50.874614],[-3.732303,50.874614],[3.732303,50.429347]]]},"place_type":"city","name":"West
Devon","attributes":{},"country_code":"GB","url":"http://api.twitter.com/1/geo/id/0613276b16c0d59f.json","country":"United
Kingdom","full_name":"West Devon,
Devon"},"user":{"location":"okehampton","default_profile":false,"statuses_count":1345,"profile_background_tile":true,"lang":"en",
"profile_link_color":"FC0AFC","profile_banner_url":"https://si0.twimg.com/profile_banners/503242961/1354459551","id":503242
961,"following":null,"favourites_count":492,"protected":false,"profile_text_color":"0084B4","description":"vicky pollards twin
sister ( the nice one )","verified":false,"contributors_enabled":false,"profile_sidebar_border_color":"FFFFFF","name":"vicki
phillips ","profile_background_color":"FA03DD","created_at":"Sat Feb 25 16:40:53 +0000
2012","default_profile_image":false,"followers_count":149,"profile_image_url_https":"https://si0.twimg.com/profile_images/30
83034337/fb9a8158c125dbb5a0650f58206880e0_normal.jpeg","geo_enabled":true,"profile_background_image_url":"http://a0.tw
img.com/profile_background_images/636836788/xb893f8eb29554020cb59540210c070b.jpg","profile_background_image_url_htt
ps":"https://si0.twimg.com/profile_background_images/636836788/xb893f8eb29554020cb59540210c070b.jpg","follow_request_
sent":null,"url":"http://www.facebook.com/vickipolard","utc_offset":0,"time_zone":"Casablanca","notifications":null,"profile_use
_background_image":true,"friends_count":1059,"profile_sidebar_fill_color":"DDEEF6","screen_name":"vixoakleophill","id_str":"503
242961","profile_image_url":"http://a0.twimg.com/profile_images/3083034337/fb9a8158c125dbb5a0650f58206880e0_normal.jp
eg","listed_count":0,"is_translator":false},"coordinates":{"type":"Point","coordinates":[-4.00720746,50.73529254]}}
My Tweet Collection


Collected for roughly one month

Lots of trial and error

Originally used Tweepy, but ran into errors

Switched to Tweetinvi and it worked
About 68 million Tweets

Apple

Amazon

Google

Microsoft

Netflix

Tesla

Ford (Probably should have used a different car company)
Yahoo! Finance Detour

Use an HTTP request to get stock data

http://fiance.yahoo.com/d/quotes.csv?s=AAPL+GOOG+MSFT+YHOO+NFLX+AM
ZN+TSLA+F&f=snb2b3opl1t1d1

Create a metric with stock data and compare the sentiment of a company to
their performance
Big Data (and regular data) Tools

Talend Open Studio

Hadoop

SAP HANA
Talend Open Studio

Open Source ETL Tool


Built on Eclipse
Data Quality and Format Issues

Even though I saved Tweets in delimited format, issues remained

Iterated through all 12,736 files with 5000 tweets each

Verified each row against a schema

Mapped to different output files


Tweet (Fact table)

Tracks

User Mentions

Hashtags

URLs
Demo Time!
Hadoop Overview

Based on the Hadoop Distributed File System and MapReduce

MapReduce is a way of parallelizing code using batch processing

Map finds the data you’re looking for

Reduce aggregates that data (count, sum, average)

Embarrassingly parallel processing

Each server in a Hadoop cluster is referred to as a Node


NameNode

DataNode
Blocks of data are replicated to three nodes

Extremely fault tolerant
More Hadoop

Open-source technology

Cloudera vs. Hortonworks

Intel, IBM, MapR, Amazon EMR

Cloudera and Hortonworks are the two biggest faces of Hadoop

Intel actively contributes to optimize it for Xeon Processors

IBM and MapR also involved

Big companies and entities use it
Hadoop Projects

Hive

Data Warehouse on top of Hadoop

Uses HiveQl (essentially SQL with a few extras) to query data




Abstracts MapReduce processes
Has an ODBC connector to allow it talk to anything that talks to databases
Pig

Uses a language called Pig Latin to analyze data

Data flow language abstracts MapReduce for easy use for data analysts
HBase

Billions of rows and millions of columns

Distributed column data store
Hadoop Trivia Time

Who created Hadoop?

Why is it called Hadoop?

Who developed the concept of MapReduce?

What does Facebook Messenger use to store its data?

Who created Hive?

What is Accumulo and who created it?
2nd Generation Hadoop


Much faster than previous versions

Hive 0.12 is up to 50X faster than previous versions

Hortonworks Stinger project aims for 100X performance improvement
Projects like Spark are moving towards real-time analysis

In-memory cluster compute analysis

Streaming processing with routines written in Python and Scala

Shark is an implementation of Hive using Spark instead of MapReduce
Hadoop Sentiment Analysis

Used the “Dumb” method of Sentiment Analysis

Import the data into HDFS and create Hive tables

Tweet

Sentiment Dictionary

Explode words in each tweet to create a view with TweetID and Word

Join with the Sentiment Dictionary on the word to get sentiment value

Demo Time!
SAP HANA

In-Memory, Column-Store database

Loads all data into main-memory


Analyze billion of rows with sub-second response time
Column-store table structure

Allows for much better compression and parallelization than row-store

Used for real-time analytics

Available with an on premise appliance or cloud-based VM
Why is SAP HANA Awesome?

Column-stores are naturally very good at parallelization

In-Memory means no waiting on IO from disks and is still hundreds of times
faster than SSD

Feature rich


Text analytics

Predicative Analytics Library

Application Server

It is an actual Database and does everything a database does
Demo Time
SAP HANA Sentiment Analysis

Sentiment is calculated when creating a full-text index on the text of the
tweet

Creates a sentiment value for each tweet

Analyze by my different dimensions

Aggregate sentiment by hour

Demo Time!
Other Text Analysis Options

Python Natural Language Toolkit


Analyze parts of speech and context
Should be possible to integrate with Hadoop (The Google did not help)
Other Big Data Problems

A GE Engine on a transatlantic flight generates 2TB of sensor data

There’s four engines on a 747

What does the LHC at CERN do with their 15 petabytes of data they create
annually?

How does the NSA store a yottabyte of data?

How does a small online gaming company analyze their customer base to
increase retention and margins?
How is Sentiment Analysis Being Used?

Companies ingest their social media feeds into these systems

If a Tweet or Facebook post meets a certain criteria, an automated or human
response can be requested
Hot vs. Cold Data



Hot data is the recent data you are most interested in

Keep this data in SAP HANA for real-time processing

Archive it after a period of time: 1 month, 3 months, 6 months, etc…
Cold Data is your historical data

Data warehouses that can handle massive volumes of data are EXPENSIVE!!!!

Use Hadoop and Hive as your data warehouse

It only costs the hardware
Still able to analyze cold data, store it cheaply, and integrate with SAP HANA
Download