Master of Information Technology COMP90055 Computing Project Credit point: 25 Project Report The World Cup 2014 Sentiment Analysis Zeyu Zhao Student number: 552096 Supervisor: Richard Sinnott I declare that -this assignment is my own work and does not involve plagiarism or collusion. I also declare that the material contained in this assignment has not previously been submitted for assessment in any other formal course of study. - the thesis is 2844 words in length (excluding text in images, table, bibliographies). TABLE OF CONTENTS Abstract ..................................................................................................................... 1 I. Introduction ........................................................................................................ 1 A. Key techniques .................................................................................................. 1 a. Twitter API ..................................................................................................... 1 b. Stanford CoreNLP API ................................................................................... 1 c. Weka .............................................................................................................. 1 d. Google Map JavaScript API ........................................................................... 1 e. HighCharts API .............................................................................................. 2 II. System Design ................................................................................................... 2 A. System Architecture .......................................................................................... 2 a. NeCTAR Research Cloud .............................................................................. 2 b. Instances and Volumes .................................................................................. 2 c. Harvest and Analysis System ......................................................................... 2 B. Twitter harvesting .............................................................................................. 4 a. Streaming API Harvester ................................................................................ 4 b. Search API Harvester..................................................................................... 5 C. Database Design............................................................................................... 5 D. Web interface .................................................................................................... 6 E. Data Analysis .................................................................................................... 6 a. Sentiment analysis ......................................................................................... 6 b. Prediction and actual event ............................................................................ 6 III. System Implementation .................................................................................. 7 A. The environment of implementation................................................................... 7 B. Twitter Harvester ............................................................................................... 7 IV. Data analysis results....................................................................................... 7 A. Total tweets harvested ...................................................................................... 8 B. Sentiment analysis in Australia and five major cities .......................................... 8 C. Actual events and goal prediction ...................................................................... 8 D. Sentiment geographical distribution analysis ................................................. 9 E. Top 20 Hashtags ......................................................................................... 10 F. Sentiment tendency during World Cup ......................................................... 11 G. Language analysis of World Cup ................................................................. 11 V. Conclusion ....................................................................................................... 11 Reference ............................................................................................................... 13 ABSTRACT This project is focused on the development of algorithms for prediction of actual events using social media with focus on sentiment analysis. Specifically the work is focus on the detailed analysis of the FIFA 2014 World Cup sentiment analysis of tweets collected across the major Australian cities (Melbourne, Sydney, Brisbane, Adelaide and Perth). The work encompasses harvesting tweets for every World Cup team from each Australian city and focuses on positive and negative sentiment analysis. This is used to identify (geospatial clusters) of nationalities and locations of Australian sports fans. The work is then focus on the detailed analysis of actual events (goals scored, penalties awarded, red cards given etc.) in the World Cup using Twitter sentiment analysis. Algorithms are then developed to predict events and subsequently establish the correlation of events with actual events that took place. Natural language processing approaches are explored in this context. The project is developed and provisioned through Cloud infrastructure on the National eResearch Collaboration Tools and Resources (NeCTAR – www.nectar.org.au) Research Cloud. The World Cup 2014 Sentiment Analysis Zeyu Zhao I. INTRODUCTION In recent years, Twitter analysis has become more popular with the broad uptake of Twitter. There is plenty of information now made available through millions of tweets that are made (tweeted) each day. In 2014, the Brazil World Cup was one of the most famous sports events that was globally tweeted. The World Cup started on June 12th and ended on July 13th. During this period, over 672 million tweets related to the hashtag “#WorldCup”, were sent all over the world according to Rogers (2014). In the major cities of Australia, there were millions of tweets about the World Cup. This information can be used to analysed with regard to the World Cup and the fans and population more generally across Australia. The aim of this paper is to explore sentiment analysis of Twitter data related to the World Cup from tweets in Australia, and develop algorithms, which can be used to find the correlation of actual events from public tweets. This paper begins with a description of the system architecture. It then focuses on the details of the components and approaches used in the development and delivery of the system. This includes the approaches used to harvest tweets, estimate the sentiment of tweets and ultimately predict goals. Next, it introduces the requirements of implementing this system and the visualization for the results.. Finally, the paper sums up the information regarding the Australian World Cup tweets and the correlation between actual events with public tweets. A. Key techniques a. Twitter API In this project, the Twitter API is used to collect tweets. Users can connect to Twitter by making requests or persisting HTTP connection through the Streaming API. b. Stanford CoreNLP API The Stanford CoreNLP API provides a number of functions to deal with natural language processing. In this project, we use the sentiment analysis part of the Stanford CoreNLP API. c. Weka The Weka software consists of a set of visualization tools and algorithms for data analysis and predictive modelling. In this project, Weka is used for goal prediction based on the pre-processed data in five major cities of Australia. d. Google Map JavaScript API Google Maps are used to demonstrate geographic distribution of results. A heat map is created to show geographical distribution of tweets. 1 e. HighCharts API HighCharts API is used for demonstrate sentiment analysis through line graphs and bar charts. Actual events can be displayed on line graph by using corresponding symbols, which demonstrate data more comprehensively. II. SYSTEM DESIGN A. System Architecture a. NeCTAR Research Cloud In this project, the NeCTAR Research Cloud’s infrastructure is used to implement the twitter harvest, data store, sentiment analysis, goal prediction and the user interface. In terms of NeCTAR Research Cloud, instances and volumes can be set through the web interface of NeCTAR, and data can be transferred to instances conveniently by mounting volumes to instances. b. Instances and Volumes This system consists of 7 instances and 1 volume. Five instances are used for harvesting and analysing tweets with each instance implemented for each Australia major city with 1 core and 8GB RAM. The volume is mounted to a “Database” instance with 100GB. This is used to store all the JSON documents of tweets. Map-Reduce views of documents and goals prediction is displayed through the client browser which is deployed in the “Web” instance. c. Harvest and Analysis System The system architecture can be divided into two parts: the harvesting system and the analysis system. As shown in Figure 1: one twitter harvester maintains one streaming connection between the twitter server and CouchDB and five twitter harvesters with search API are used for each major city of Australia. These five twitter harvesters are deployed in different virtual machines that are used for collecting relevant tweets and storing them into CouchDB. 2 Figure 1. Harvest System Architecture Another key part of the system is the analysis system, which is shown in figure 2. In CouchDB, different map-reduce views were created for different scenarios. Additionally, a sentiment analyzer is used for analysing tweets and attaching a sentiment score for every tweet (JSON document). Actual event predictors are used to pre-process data for Weka to predict when goals happen during matches. In order to display all the analyses and predictions, a web browser interface is deployed using HighCharts and the Google Maps JavaScript API. In the web interface, raw data (JSON documents) from CouchDB are made available through Ajax. Instead of asynchronous transmission, the web interface uses synchronous transmission to ensure the processing of data transmission from CouchDB completes before delivering it to the browser. 3 Figure 2. Analysis System Architecture B. Twitter harvesting There are two kinds of Twitter API: REST and Streaming APIs. The Rest API allows developers to read and write twitter data whilst the Search API offers a REST API. The Streaming API allows developers to access Twitter’s global stream of Tweet data with low latency. Both the Search API and Streaming API are deployed in this project to harvest more tweets. In this project, the Streaming API harvester is more focused on harvesting tweets from users who have sent relevant tweets in real-time, whilst the Search API harvesters are more focus on harvest those users’ previous relevance tweets. a. Streaming API Harvester To collect relevant tweets from the five major cities of Australia, there are two factors that influence the number of collected tweets. One is the tweet’s location, another one is keywords mentioned in the tweet. The streaming filter was used to harvest tweets by specific locations with a given bounding box, then checked whether there was a keyword (such as ”worldcup”, “brazil2014”,”soccer” and etc.) mentioned in tweet. However, this method did not harvest tweets efficiently, because only a few tweets, which have coordinates in the bounding box, have keywords. On the other hand, when using the streaming API, it was possible to track numbers of keywords than harvesting tweets based on the property of user location. This was used to harvest more tweets. When relevant tweets are recognized, the program establishes a connection with CouchDB and creates databases and/or opens existing databases related to the five major cities if they are already created. Before saving relevant tweets into CouchDB, it is necessary to insert the unique id of the tweet into the tweet JSON 4 object as a document id. In this case, replications can be avoided in the database, since CouchDB cannot save documents with the same id. Streaming harvester’s flow chart is shown in figure 3. Figure 3. Flow chart of streaming harvester b. Search API Harvester Because the Brazil World Cup has finished, the Search API was a more effective way to collect old tweets than using the Streaming API. After the Streaming API harvester has harvested a number of tweets, a mapreduce was used in each database to list user ids of each tweet, get replied users’ ids and mentioned users’ ids. In this case, it can expand the range of relevant users that has positive impact on subsequent processes. “GET statuses/user_timeline” is used to harvest previous tweets in the Search API harvester, recording each “max_id” of every timeline page to improve the efficiency of the timeline and avoid reading a tweet more than once. In addition, the harvested tweet is checked as to whether it contains a keyword and whether its user location contains the name of five major cities before inserting this tweet into that city’s database. C. Database Design CouchDB is deployed as the database server in this project. CouchDB is a document-oriented database. Since the format of harvested tweets and the format of CouchDB documents are both JSON objects, CouchDB is suitable for storing tweets. Compared with relational databases, CouchDB makes data reading and writing easier without ensuring data consistency. 5 The attached volume in the “Database” instance is used to store tweets and view index files. In this project, several views were created were used to establish the tweets count (positive, negative, neutral and total count), tweets per minute for each match from kick-off to 3 hours later, for hashtags used during the world cup, for tweets count based on the language and user sentiment, for users mentioned in tweets and tweet coordinates. D. Web interface The web interface uses the jQuery UI to build the user interface and jQuery AJAX to communicate with CouchDB. The actual event data, the data of tweets during matches and the predictions are pre-processed (because there are occasionally no tweets in some time slots through the map-reduce, it is necessary to add “0” to those minutes, which allows the web interface to more easily read the data without extra processing) and store the results in the “web” database of CouchDB. E. Data Analysis a. Sentiment analysis In order to estimate the sentiment of each tweet, the Stanford CoreNLP models were used in this project. Because the CoreNLP sentiment model only works on sentences and a tweet may consist of several sentences, it is necessary to split a tweet into sub-sentences. CoreNLP can then predict the sentiment score for each sub-sentence. There are several ways to compute the tweet level score by applying functions to sentence level scores. One approach is averaging the sentiment score of all the sentences in the tweet and making an average score as the final sentiment score for a given tweet. Another approach uses the longest sentence sentiment score as the tweet sentiment score. This project uses the latter, because the main idea of a tweet can be expressed by the longest sentence of the tweet. After achieving the sentiment score, the tweet JSON object is updated with a new property named “sentiment_analysis” which contains the sentiment score of the particular tweet. b. Prediction and actual event According to the FIFA Official website, all the actual events information is stored as a JSON document which includes 64 matches with each match consisting of two teams. The matches have a first half and second half, extra time if it existed, and the time of actual events (such as substitutions, yellow cards, goal scored etc.) that may happen in a match. Furthermore, all of the actual events are shown on line graphs as a timeline of the match. Through the web interface, it can be observed that the number of tweets increases dramatically when a goal scored. Predicting goals (including goal scored, penalty scored and own goals) based on the tendency of tweet counts is directly possible. In total there were 171 goals during the world cup (include penalties scored and own goals), and there were more than 90% of goals that happened before a spike (tweets count per minute increased by goal event) in five cities’ data. In this case, identifying the specific spike is very important for prediction of actual events. 6 Based on the previous description, some attributes were set during the data preprocessing, which include increased rates of a spike (the high point of tweets count compared to the low point in the tweet count and the high point of the tweet count compared to the average tweet count); whether low point of tweets count are less than the average tweets count; whether low point tweets count are twice as large as the average tweets count; whether the high point in tweets count is more than the average tweet counts, and the minutes between high points and low points and the number of goals before a spike. However, it was observed that it is better to use the data from the match where the tweets count is more than 1500. Otherwise, increases in tweets from a low to a high point (or vice versa) will be inaccurate, e.g. due to random noise in the twitter data. In addition, the attribute Interval minute (the number of goals before a spike) is used to predict the number of goals before a spike. After data pre-processing, goals are predicted using the decision tree j48 to test each cities data in Weka, and each cities’ prediction is stored as a JSON document in the “web” database of CouchDB. III. SYSTEM IMPLEMENTATION A. The environment of implementation For this system, the environment of implementation is shown in table 1. Environment Cloud Programming language Database Data analytic technique Twitter harvester technique Description NeCTAR Java 1.7, JavaScript CouchDB Google Map JavaScript API, HighCharts API, Stanford CoreNLP API, Weka Twitter Streaming API, Twitter Search API Table 1. The system implementation environment B. Twitter Harvester Both the Search API harvester and Streaming API harvester use the twitter4j library for harvesting relevant tweets. In order to store those tweets (JSON documents) to CouchDB, the lightcouch library is used. For the lightcouch library, if the operating system is Windows, the text file encoding setting of the Eclipse should be set as “UTF-8”, otherwise the program cannot run on the Windows with unreadable encoding because the default setting of Eclipse is GBK. It is necessary to use Application-only Authentication for the Twitter Search API, since it can result in 300 “GET statuses/user timeline” (for the main requests using the Search API in this project) requests are made each 15 minute interval while only 180 requests per 15 minutes intervals can be made without Application-only Authentication. Furthermore, the request “GET application/rate_limit_status” is regularly called to check the program to avoid exceeding the rate limit. IV. DATA ANALYSIS RESULTS 7 The results of data analysis are demonstrated http://115.146.93.143:8080/TwitterAnalysis/index.html. on the website: A. Total tweets harvested The number of harvested tweets is shown in table 2. This is also shown in the “Overview” website. The number of harvested tweets in Australia should be the sum total of five major cities tweets count, but it is not equal because there are some repeated tweets in different databases, which are caused by some users who may change their user location during harvesting. Adelaide Brisbane Melbourne Perth Sydney Australia Total Count 422243 741138 1332894 474869 1487227 4380302 Count during World Cup 68910 122003 256650 75288 297578 805203 Table 2. The number of tweet harvested from five major cities and Australia B. Sentiment analysis in Australia and five major cities Adelaide Brisbane Melbourne Perth Sydney Australia Positive 46.6% 47.2% 48.1% 46.3% 46.2% 46.9% Negative 36.9% 37.7% 35.6% 38% 38% 37.1% Neutral 16.5% 15.1% 16.3% 15.7% 15.9% 16% Table 3. The percentage of sentiment analysis According to table 3, there are more positive tweets than negative ones in the five major cities, which may indicate nearly half of the tweets express positive satisfaction with the Brazil world cup. In terms of sentiment analysis for each match, it can be observed that there is something interesting about the match between Colombia vs Greece. This has more negative tweets (42.2%) than positive tweets (39.5%) in Melbourne. Compared with the previous match, the amount of positive tweets (58.7%) is twice as large as the amount of negative tweets (24.7%) in the match between Belgium vs USA. C. Actual events and goal prediction 8 Figure 4. The tendency of the number of tweets per minute during the match Brazil vs Germany The results of goals prediction and the changing number of tweets during the game are clearly demonstrated here. For Australia data, it can identify around 69% spikes which have a goal (or goals). But it can only identify approximately 29% spikes using just Adelaide data. In addition, the percentages of correct spikes are around 34%, 50%, 34%, 51% in Brisbane, Melbourne, Perth, Sydney respectively. In short, more data gives more accurate results. Figure 4 shows the detailed analysis of the match between Brazil vs Germany with sentiment analysis. As we can see on the x axis, the time line of the match from the kick-off to 3 hours later, and actual soccer events are displayed. In addition, each symbol’s explanation is shown above the line graph. After selecting the “predict goals” button, the result of prediction is shown on the blue line. This depicts the tweets count per minute. In this line graph, all of the actual goals that happened before a spike are predicted, which indicates the happening of actual soccer events have positive correlation when there is a significant number of tweets and event in matches. D. Sentiment geographical distribution analysis Figure 5 shows the sentiment distribution of tweets with coordinates across Australia. The left map shows the positive tweet distribution and the right one the negative tweet distribution. In addition, five markers are located on the five major cities. 9 Figure 5. Sentiment geographical distribution in Australia The distribution of positive tweets is similar to the distribution of negative tweets. However, there are still some differences in smaller areas. As figure 6 shows, it shows there are more positive tweets than negative tweets around Southgate Centre in Melbourne. Figure 6. Sentiment geographical distribution in Melbourne E. Top 20 Hashtags Table 3 illustrates the most popular hashtags from the Australian tweets. As can be seen, Australians paid more attention to their own soccer team because there are large numbers of “#GoSocceroos” and “#Socceroos” occurring in tweets. Rank 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Hashtags #WorldCup #auspol #WorldCup2014 #GoSocceroos #SBSWorldCup #worldcup #GER #CHIAUS #AUSNED #BRA #ARG #NED #Socceroos #AUS #BRAGER #WorldCupFinal #GERARG #Melbourne #FifaWorldCup #Brazil2014 Count 82419 70821 50793 34798 25034 16095 13807 10260 9760 9180 9041 8102 8055 7979 7414 6542 5770 5689 4958 4833 Table 3. Top 20 hashtags in tweets 10 F. Sentiment tendency during World Cup This scenario shows how the sentiment changed per day during the World Cup for one team (only Australia and Germany were realised). This uses Map-Reduce to create a view of tweets that include the team’s twitter screen name, the country name and/or the abbreviation of the country name. Figure 8 shows the sentiment is changed significantly when there is a match on that day. Furthermore, Australians have more positive viewpoints than negative ones than the German world cup team, because there were much more positive tweets than negative tweets on Jul 8th on the day of the World Cup Final. Figure 8. Germany team sentiment tendency during World Cup 2014 G. Language analysis of World Cup This scenario is based on the “lang” property of the tweet JSON document. As we can see from Figure 9, Australian tweets consist of 46 kinds of language and most tweets are in English. In addition, there is also a sentiment analysis for each language tweet. According to the analysis, around 88% Japanese tweets are negative and only German tweets have more positive tweets (44.5%) than negative tweets (except English tweets). This indicates that Japanese may be the most pessimistic about the World Cup. Figure 9. Tweets languages in Australia V. CONCLUSION 11 This paper presented a brief introduction of sentiment analysis of World Cup tweets in Australia and algorithms about goal prediction based on the tweets. In addition, the collected data indicates that most Australia tweets are positive – despite their early exit from the World Cup. The geographical distribution of positive and negative tweets are broadly similar. The actual sports events have a positive impact on the numbers of tweets and most of goals happened before a spike through the analysis – hence it is possible to identify game events from twitter. We identified that Japanese were the most emotional (negative) people in the World Cup 2014. The system has provided an exploration of all matches and events in the world cup. We have shown that Twitter is far more than just a 140 character string, but can be used to show national fervor, fan bases across Australia and also the most sports mad locations of Australia. 12 REFERENCE Anderson, JC, Slater, N & Lehnardt, J 2009, CouchDB : the definitive guide, O'Reilly Media, Inc., Sebastopol, Calif. Belmonte, N (@philogb), 2014, The global conversation about the #WorldCup, weblog post, July 10, Twitter, <https://blog.twitter.com/2014/the-globalconversation-about-the-worldcup> Bloch,J 2008, Effective Java, Addison Wesley, Upper Saddle River, NJ. D.Manning , C, Surdeanu , M, Bauer , J, Finkel , J, J. Bethard , S & McClosky , D, The Stanford CoreNLP Natural Language Processing Toolkit Dakhode, V & Kolekar, S 2011, Development of a Weka Filter for Reconstruction of Incomplete Data Sets using Attribute Relation Analysis, Advances in Computational Sciences & Technology, vol. 4, p227-231 FIFA's Official Social Platforms, 2014, FIFA, <http://www.fifa.com/worldcup/> Graham, M, Hale, SA. & Gaffney, D 2013, Where in the World are You? Geolocation and Language Identification in Twitter Green, S, Chuang, J, Heer, J & D. Manning, C 2014, Predictive translation memory: a mixed-initiative system for human language translation, Proceedings of the 27th annual ACM symposium on User interface software and technology, pp.177-187. Machine Learning Group at the <http://www.cs.waikato.ac.nz/ml/weka/ > University of Waikato, Rogers, S (@smfrogers), 2014, Insights into the #WorldCup conversation on Twitter, webclog post, July 14, Twitter, <https://blog.twitter.com/2014/insights-into-the-worldcup-conversation-ontwitter> Shrader, W, 2012, GOOGLE HEAT MAPS, Cartographic Perspectives, no .72, pp.85-89 Twitter, <https://dev.twitter.com/overview/documentation> World Cup in JSON, <http://worldcup.sfg.io/matches> 13