The World Cup 2014 Sentiment Analysis

advertisement
Master of Information Technology
COMP90055 Computing Project
Credit point: 25
Project Report
The World Cup 2014 Sentiment Analysis
Zeyu Zhao
Student number: 552096
Supervisor: Richard Sinnott
I declare that
-this assignment is my own work and does not involve plagiarism or collusion. I also
declare that the material contained in this assignment has not previously been
submitted for assessment in any other formal course of study.
- the thesis is 2844 words in length (excluding text in images, table, bibliographies).
TABLE OF CONTENTS
Abstract ..................................................................................................................... 1
I.
Introduction ........................................................................................................ 1
A. Key techniques .................................................................................................. 1
a. Twitter API ..................................................................................................... 1
b. Stanford CoreNLP API ................................................................................... 1
c. Weka .............................................................................................................. 1
d. Google Map JavaScript API ........................................................................... 1
e. HighCharts API .............................................................................................. 2
II.
System Design ................................................................................................... 2
A. System Architecture .......................................................................................... 2
a. NeCTAR Research Cloud .............................................................................. 2
b. Instances and Volumes .................................................................................. 2
c. Harvest and Analysis System ......................................................................... 2
B. Twitter harvesting .............................................................................................. 4
a. Streaming API Harvester ................................................................................ 4
b. Search API Harvester..................................................................................... 5
C. Database Design............................................................................................... 5
D. Web interface .................................................................................................... 6
E. Data Analysis .................................................................................................... 6
a. Sentiment analysis ......................................................................................... 6
b. Prediction and actual event ............................................................................ 6
III.
System Implementation .................................................................................. 7
A. The environment of implementation................................................................... 7
B. Twitter Harvester ............................................................................................... 7
IV.
Data analysis results....................................................................................... 7
A. Total tweets harvested ...................................................................................... 8
B. Sentiment analysis in Australia and five major cities .......................................... 8
C. Actual events and goal prediction ...................................................................... 8
D. Sentiment geographical distribution analysis ................................................. 9
E. Top 20 Hashtags ......................................................................................... 10
F. Sentiment tendency during World Cup ......................................................... 11
G. Language analysis of World Cup ................................................................. 11
V.
Conclusion ....................................................................................................... 11
Reference ............................................................................................................... 13
ABSTRACT
This project is focused on the development of algorithms for prediction of actual
events using social media with focus on sentiment analysis. Specifically the work is
focus on the detailed analysis of the FIFA 2014 World Cup sentiment analysis of
tweets collected across the major Australian cities (Melbourne, Sydney, Brisbane,
Adelaide and Perth). The work encompasses harvesting tweets for every World Cup
team from each Australian city and focuses on positive and negative sentiment
analysis. This is used to identify (geospatial clusters) of nationalities and locations of
Australian sports fans. The work is then focus on the detailed analysis of actual
events (goals scored, penalties awarded, red cards given etc.) in the World Cup
using Twitter sentiment analysis. Algorithms are then developed to predict events
and subsequently establish the correlation of events with actual events that took
place. Natural language processing approaches are explored in this context. The
project is developed and provisioned through Cloud infrastructure on the National
eResearch Collaboration Tools and Resources (NeCTAR – www.nectar.org.au)
Research Cloud.
The World Cup 2014 Sentiment Analysis
Zeyu Zhao
I.
INTRODUCTION
In recent years, Twitter analysis has become more popular with the broad
uptake of Twitter. There is plenty of information now made available through millions
of tweets that are made (tweeted) each day. In 2014, the Brazil World Cup was one
of the most famous sports events that was globally tweeted. The World Cup started
on June 12th and ended on July 13th. During this period, over 672 million tweets
related to the hashtag “#WorldCup”, were sent all over the world according to Rogers
(2014). In the major cities of Australia, there were millions of tweets about the World
Cup. This information can be used to analysed with regard to the World Cup and the
fans and population more generally across Australia.
The aim of this paper is to explore sentiment analysis of Twitter data related to
the World Cup from tweets in Australia, and develop algorithms, which can be used
to find the correlation of actual events from public tweets. This paper begins with a
description of the system architecture. It then focuses on the details of the
components and approaches used in the development and delivery of the system.
This includes the approaches used to harvest tweets, estimate the sentiment of
tweets and ultimately predict goals. Next, it introduces the requirements of
implementing this system and the visualization for the results.. Finally, the paper
sums up the information regarding the Australian World Cup tweets and the
correlation between actual events with public tweets.
A. Key techniques
a. Twitter API
In this project, the Twitter API is used to collect tweets. Users can connect to Twitter
by making requests or persisting HTTP connection through the Streaming API.
b. Stanford CoreNLP API
The Stanford CoreNLP API provides a number of functions to deal with natural
language processing. In this project, we use the sentiment analysis part of the
Stanford CoreNLP API.
c. Weka
The Weka software consists of a set of visualization tools and algorithms for data
analysis and predictive modelling. In this project, Weka is used for goal prediction
based on the pre-processed data in five major cities of Australia.
d. Google Map JavaScript API
Google Maps are used to demonstrate geographic distribution of results. A heat map
is created to show geographical distribution of tweets.
1
e. HighCharts API
HighCharts API is used for demonstrate sentiment analysis through line graphs and
bar charts. Actual events can be displayed on line graph by using corresponding
symbols, which demonstrate data more comprehensively.
II.
SYSTEM DESIGN
A. System Architecture
a. NeCTAR Research Cloud
In this project, the NeCTAR Research Cloud’s infrastructure is used to
implement the twitter harvest, data store, sentiment analysis, goal prediction and the
user interface. In terms of NeCTAR Research Cloud, instances and volumes can be
set through the web interface of NeCTAR, and data can be transferred to instances
conveniently by mounting volumes to instances.
b. Instances and Volumes
This system consists of 7 instances and 1 volume. Five instances are used
for harvesting and analysing tweets with each instance implemented for each
Australia major city with 1 core and 8GB RAM. The volume is mounted to a
“Database” instance with 100GB. This is used to store all the JSON documents of
tweets. Map-Reduce views of documents and goals prediction is displayed through
the client browser which is deployed in the “Web” instance.
c. Harvest and Analysis System
The system architecture can be divided into two parts: the harvesting system
and the analysis system. As shown in Figure 1: one twitter harvester maintains one
streaming connection between the twitter server and CouchDB and five twitter
harvesters with search API are used for each major city of Australia. These five
twitter harvesters are deployed in different virtual machines that are used for
collecting relevant tweets and storing them into CouchDB.
2
Figure 1. Harvest System Architecture
Another key part of the system is the analysis system, which is shown in
figure 2. In CouchDB, different map-reduce views were created for different
scenarios. Additionally, a sentiment analyzer is used for analysing tweets and
attaching a sentiment score for every tweet (JSON document). Actual event
predictors are used to pre-process data for Weka to predict when goals happen
during matches. In order to display all the analyses and predictions, a web browser
interface is deployed using HighCharts and the Google Maps JavaScript API. In the
web interface, raw data (JSON documents) from CouchDB are made available
through Ajax. Instead of asynchronous transmission, the web interface uses
synchronous transmission to ensure the processing of data transmission from
CouchDB completes before delivering it to the browser.
3
Figure 2. Analysis System Architecture
B. Twitter harvesting
There are two kinds of Twitter API: REST and Streaming APIs. The Rest API
allows developers to read and write twitter data whilst the Search API offers a REST
API. The Streaming API allows developers to access Twitter’s global stream of Tweet
data with low latency. Both the Search API and Streaming API are deployed in this
project to harvest more tweets. In this project, the Streaming API harvester is more
focused on harvesting tweets from users who have sent relevant tweets in real-time,
whilst the Search API harvesters are more focus on harvest those users’ previous
relevance tweets.
a. Streaming API Harvester
To collect relevant tweets from the five major cities of Australia, there are two
factors that influence the number of collected tweets. One is the tweet’s location,
another one is keywords mentioned in the tweet. The streaming filter was used to
harvest tweets by specific locations with a given bounding box, then checked
whether there was a keyword (such as ”worldcup”, “brazil2014”,”soccer” and etc.)
mentioned in tweet. However, this method did not harvest tweets efficiently, because
only a few tweets, which have coordinates in the bounding box, have keywords. On
the other hand, when using the streaming API, it was possible to track numbers of
keywords than harvesting tweets based on the property of user location. This was
used to harvest more tweets.
When relevant tweets are recognized, the program establishes a connection
with CouchDB and creates databases and/or opens existing databases related to the
five major cities if they are already created. Before saving relevant tweets into
CouchDB, it is necessary to insert the unique id of the tweet into the tweet JSON
4
object as a document id. In this case, replications can be avoided in the database,
since CouchDB cannot save documents with the same id.
Streaming harvester’s flow chart is shown in figure 3.
Figure 3. Flow chart of streaming harvester
b. Search API Harvester
Because the Brazil World Cup has finished, the Search API was a more
effective way to collect old tweets than using the Streaming API.
After the Streaming API harvester has harvested a number of tweets, a mapreduce was used in each database to list user ids of each tweet, get replied users’
ids and mentioned users’ ids. In this case, it can expand the range of relevant users
that has positive impact on subsequent processes. “GET statuses/user_timeline” is
used to harvest previous tweets in the Search API harvester, recording each “max_id”
of every timeline page to improve the efficiency of the timeline and avoid reading a
tweet more than once. In addition, the harvested tweet is checked as to whether it
contains a keyword and whether its user location contains the name of five major
cities before inserting this tweet into that city’s database.
C. Database Design
CouchDB is deployed as the database server in this project. CouchDB is a
document-oriented database. Since the format of harvested tweets and the format of
CouchDB documents are both JSON objects, CouchDB is suitable for storing tweets.
Compared with relational databases, CouchDB makes data reading and writing
easier without ensuring data consistency.
5
The attached volume in the “Database” instance is used to store tweets and
view index files. In this project, several views were created were used to establish the
tweets count (positive, negative, neutral and total count), tweets per minute for each
match from kick-off to 3 hours later, for hashtags used during the world cup, for
tweets count based on the language and user sentiment, for users mentioned in
tweets and tweet coordinates.
D. Web interface
The web interface uses the jQuery UI to build the user interface and jQuery AJAX to
communicate with CouchDB. The actual event data, the data of tweets during
matches and the predictions are pre-processed (because there are occasionally no
tweets in some time slots through the map-reduce, it is necessary to add “0” to those
minutes, which allows the web interface to more easily read the data without extra
processing) and store the results in the “web” database of CouchDB.
E. Data Analysis
a. Sentiment analysis
In order to estimate the sentiment of each tweet, the Stanford CoreNLP
models were used in this project. Because the CoreNLP sentiment model only works
on sentences and a tweet may consist of several sentences, it is necessary to split a
tweet into sub-sentences. CoreNLP can then predict the sentiment score for each
sub-sentence.
There are several ways to compute the tweet level score by applying
functions to sentence level scores. One approach is averaging the sentiment score of
all the sentences in the tweet and making an average score as the final sentiment
score for a given tweet. Another approach uses the longest sentence sentiment
score as the tweet sentiment score. This project uses the latter, because the main
idea of a tweet can be expressed by the longest sentence of the tweet. After
achieving the sentiment score, the tweet JSON object is updated with a new property
named “sentiment_analysis” which contains the sentiment score of the particular
tweet.
b. Prediction and actual event
According to the FIFA Official website, all the actual events information is stored as a
JSON document which includes 64 matches with each match consisting of two teams.
The matches have a first half and second half, extra time if it existed, and the time of
actual events (such as substitutions, yellow cards, goal scored etc.) that may happen
in a match. Furthermore, all of the actual events are shown on line graphs as a
timeline of the match.
Through the web interface, it can be observed that the number of tweets increases
dramatically when a goal scored. Predicting goals (including goal scored, penalty
scored and own goals) based on the tendency of tweet counts is directly possible.
In total there were 171 goals during the world cup (include penalties scored and own
goals), and there were more than 90% of goals that happened before a spike (tweets
count per minute increased by goal event) in five cities’ data. In this case, identifying
the specific spike is very important for prediction of actual events.
6
Based on the previous description, some attributes were set during the data preprocessing, which include increased rates of a spike (the high point of tweets count
compared to the low point in the tweet count and the high point of the tweet count
compared to the average tweet count); whether low point of tweets count are less
than the average tweets count; whether low point tweets count are twice as large as
the average tweets count; whether the high point in tweets count is more than the
average tweet counts, and the minutes between high points and low points and the
number of goals before a spike.
However, it was observed that it is better to use the data from the match where the
tweets count is more than 1500. Otherwise, increases in tweets from a low to a high
point (or vice versa) will be inaccurate, e.g. due to random noise in the twitter data. In
addition, the attribute Interval minute (the number of goals before a spike) is used to
predict the number of goals before a spike. After data pre-processing, goals are
predicted using the decision tree j48 to test each cities data in Weka, and each cities’
prediction is stored as a JSON document in the “web” database of CouchDB.
III.
SYSTEM IMPLEMENTATION
A. The environment of implementation
For this system, the environment of implementation is shown in table 1.
Environment
Cloud
Programming language
Database
Data analytic technique
Twitter harvester technique
Description
NeCTAR
Java 1.7, JavaScript
CouchDB
Google Map JavaScript API, HighCharts
API, Stanford CoreNLP API, Weka
Twitter Streaming API, Twitter Search
API
Table 1. The system implementation environment
B. Twitter Harvester
Both the Search API harvester and Streaming API harvester use the twitter4j
library for harvesting relevant tweets. In order to store those tweets (JSON
documents) to CouchDB, the lightcouch library is used. For the lightcouch library, if
the operating system is Windows, the text file encoding setting of the Eclipse should
be set as “UTF-8”, otherwise the program cannot run on the Windows with
unreadable encoding because the default setting of Eclipse is GBK.
It is necessary to use Application-only Authentication for the Twitter Search
API, since it can result in 300 “GET statuses/user timeline” (for the main requests
using the Search API in this project) requests are made each 15 minute interval while
only 180 requests per 15 minutes intervals can be made without Application-only
Authentication. Furthermore, the request “GET application/rate_limit_status” is
regularly called to check the program to avoid exceeding the rate limit.
IV.
DATA ANALYSIS RESULTS
7
The results of data analysis are demonstrated
http://115.146.93.143:8080/TwitterAnalysis/index.html.
on
the
website:
A. Total tweets harvested
The number of harvested tweets is shown in table 2. This is also shown in the
“Overview” website. The number of harvested tweets in Australia should be the sum
total of five major cities tweets count, but it is not equal because there are some
repeated tweets in different databases, which are caused by some users who may
change their user location during harvesting.
Adelaide
Brisbane
Melbourne
Perth
Sydney
Australia
Total Count
422243
741138
1332894
474869
1487227
4380302
Count during World Cup
68910
122003
256650
75288
297578
805203
Table 2. The number of tweet harvested from five major cities and Australia
B. Sentiment analysis in Australia and five major cities
Adelaide
Brisbane
Melbourne
Perth
Sydney
Australia
Positive
46.6%
47.2%
48.1%
46.3%
46.2%
46.9%
Negative
36.9%
37.7%
35.6%
38%
38%
37.1%
Neutral
16.5%
15.1%
16.3%
15.7%
15.9%
16%
Table 3. The percentage of sentiment analysis
According to table 3, there are more positive tweets than negative ones in the five
major cities, which may indicate nearly half of the tweets express positive satisfaction
with the Brazil world cup.
In terms of sentiment analysis for each match, it can be observed that there is
something interesting about the match between Colombia vs Greece. This has more
negative tweets (42.2%) than positive tweets (39.5%) in Melbourne. Compared with
the previous match, the amount of positive tweets (58.7%) is twice as large as the
amount of negative tweets (24.7%) in the match between Belgium vs USA.
C. Actual events and goal prediction
8
Figure 4. The tendency of the number of tweets per minute during the match Brazil
vs Germany
The results of goals prediction and the changing number of tweets during the game
are clearly demonstrated here. For Australia data, it can identify around 69% spikes
which have a goal (or goals). But it can only identify approximately 29% spikes using
just Adelaide data. In addition, the percentages of correct spikes are around 34%,
50%, 34%, 51% in Brisbane, Melbourne, Perth, Sydney respectively. In short, more
data gives more accurate results.
Figure 4 shows the detailed analysis of the match between Brazil vs Germany with
sentiment analysis. As we can see on the x axis, the time line of the match from the
kick-off to 3 hours later, and actual soccer events are displayed. In addition, each
symbol’s explanation is shown above the line graph. After selecting the “predict goals”
button, the result of prediction is shown on the blue line. This depicts the tweets
count per minute. In this line graph, all of the actual goals that happened before a
spike are predicted, which indicates the happening of actual soccer events have
positive correlation when there is a significant number of tweets and event in
matches.
D. Sentiment geographical distribution analysis
Figure 5 shows the sentiment distribution of tweets with coordinates across Australia.
The left map shows the positive tweet distribution and the right one the negative
tweet distribution. In addition, five markers are located on the five major cities.
9
Figure 5. Sentiment geographical distribution in Australia
The distribution of positive tweets is similar to the distribution of negative tweets.
However, there are still some differences in smaller areas. As figure 6 shows, it
shows there are more positive tweets than negative tweets around Southgate Centre
in Melbourne.
Figure 6. Sentiment geographical distribution in Melbourne
E. Top 20 Hashtags
Table 3 illustrates the most popular hashtags from the Australian tweets. As can be
seen, Australians paid more attention to their own soccer team because there are
large numbers of “#GoSocceroos” and “#Socceroos” occurring in tweets.
Rank
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
Hashtags
#WorldCup
#auspol
#WorldCup2014
#GoSocceroos
#SBSWorldCup
#worldcup
#GER
#CHIAUS
#AUSNED
#BRA
#ARG
#NED
#Socceroos
#AUS
#BRAGER
#WorldCupFinal
#GERARG
#Melbourne
#FifaWorldCup
#Brazil2014
Count
82419
70821
50793
34798
25034
16095
13807
10260
9760
9180
9041
8102
8055
7979
7414
6542
5770
5689
4958
4833
Table 3. Top 20 hashtags in tweets
10
F. Sentiment tendency during World Cup
This scenario shows how the sentiment changed per day during the World Cup for
one team (only Australia and Germany were realised). This uses Map-Reduce to
create a view of tweets that include the team’s twitter screen name, the country name
and/or the abbreviation of the country name. Figure 8 shows the sentiment is
changed significantly when there is a match on that day. Furthermore, Australians
have more positive viewpoints than negative ones than the German world cup team,
because there were much more positive tweets than negative tweets on Jul 8th on the
day of the World Cup Final.
Figure 8. Germany team sentiment tendency during World Cup 2014
G. Language analysis of World Cup
This scenario is based on the “lang” property of the tweet JSON document. As we
can see from Figure 9, Australian tweets consist of 46 kinds of language and most
tweets are in English. In addition, there is also a sentiment analysis for each
language tweet. According to the analysis, around 88% Japanese tweets are
negative and only German tweets have more positive tweets (44.5%) than negative
tweets (except English tweets). This indicates that Japanese may be the most
pessimistic about the World Cup.
Figure 9. Tweets languages in Australia
V.
CONCLUSION
11
This paper presented a brief introduction of sentiment analysis of World Cup tweets
in Australia and algorithms about goal prediction based on the tweets. In addition, the
collected data indicates that most Australia tweets are positive – despite their early
exit from the World Cup. The geographical distribution of positive and negative
tweets are broadly similar. The actual sports events have a positive impact on the
numbers of tweets and most of goals happened before a spike through the analysis –
hence it is possible to identify game events from twitter. We identified that Japanese
were the most emotional (negative) people in the World Cup 2014. The system has
provided an exploration of all matches and events in the world cup. We have shown
that Twitter is far more than just a 140 character string, but can be used to show
national fervor, fan bases across Australia and also the most sports mad locations of
Australia.
12
REFERENCE
Anderson, JC, Slater, N & Lehnardt, J 2009, CouchDB : the definitive guide,
O'Reilly Media, Inc., Sebastopol, Calif.
Belmonte, N (@philogb), 2014, The global conversation about the #WorldCup,
weblog post, July 10, Twitter, <https://blog.twitter.com/2014/the-globalconversation-about-the-worldcup>
Bloch,J 2008, Effective Java, Addison Wesley, Upper Saddle River, NJ.
D.Manning , C, Surdeanu , M, Bauer , J, Finkel , J, J. Bethard , S & McClosky ,
D, The Stanford CoreNLP Natural Language Processing Toolkit
Dakhode, V & Kolekar, S 2011, Development of a Weka Filter for
Reconstruction of Incomplete Data Sets using Attribute Relation Analysis,
Advances in Computational Sciences & Technology, vol. 4, p227-231
FIFA's Official Social Platforms, 2014, FIFA, <http://www.fifa.com/worldcup/>
Graham, M, Hale, SA. & Gaffney, D 2013, Where in the World are You?
Geolocation and Language Identification in Twitter
Green, S, Chuang, J, Heer, J & D. Manning, C 2014, Predictive translation
memory: a mixed-initiative system for human language translation,
Proceedings of the 27th annual ACM symposium on User interface software
and technology, pp.177-187.
Machine
Learning
Group
at
the
<http://www.cs.waikato.ac.nz/ml/weka/ >
University
of
Waikato,
Rogers, S (@smfrogers), 2014, Insights into the #WorldCup conversation on
Twitter,
webclog
post,
July
14,
Twitter,
<https://blog.twitter.com/2014/insights-into-the-worldcup-conversation-ontwitter>
Shrader, W, 2012, GOOGLE HEAT MAPS, Cartographic Perspectives, no .72,
pp.85-89
Twitter, <https://dev.twitter.com/overview/documentation>
World Cup in JSON, <http://worldcup.sfg.io/matches>
13
Download