by
MSc in Management and Entrepreneurship, EDHEC 2014
SUBMITTED TO THE MIT SLOAN SCHOOL OF MANAGEMENT IN PARTIAL
FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF
MASTER OF SCIENCE IN MANAGEMENT STUDIES
AT THE
MASSACHUSETTS INSTITUTE OF TECHNOLOGY
JUNE 2015
MASSACHUSETTS INSTITUTE
OF TECHNOLOLGY
JUN 2 4 2015
2015 Charles Kaioun. All rights reserved.
The author hereby grants to MIT permission to reproduce and to distribute publicly paper and electronic copies of this thesis document in whole or in part in any medium now known or hereafter created.-
Signature of Author:
Certified by:
Accepted by:
S i
MIT Sloan Scho )I of Management
May 8, 2015
Sinan Aral
David Austin Professor of Management
Thesis Supervisor
Michael A. Cusumano
SMR Distinguished Professor of Management
Program Director, M.S. in Management Studies Program
MIT Sloan School of Management
I
by
Submitted to MIT Sloan School of Management on May 8, 2015 in Partial Fulfillment of the requirements for the Degree of Master of Science in
Management Studies.
ABSTRACT
This study is an initial attempt to investigate the reach of Twitter depending on the original location of tweets. The major objective is to measure how local or global Twitter's reach is and how localized cities compare to big cosmopolitans cities.
Using a custom data scrapper, this analysis contains data on tweets emitted from 12 U.S. cities as well as the associated retweets. Retweets were mapped to a location and the distance from their point of origin and analyzed to better understand the reach by city. The analysis offers a description of a method to capture such data, the exploration of potential patterns and an attempt to correlate the reach of big cities to specific factors.
On the basis of the results of this analysis, it appears that despite Twitter being considered a global network empowering people for broader communication, small cities have almost no international reach. Twitter appears to be first and foremost a local network, and only cosmopolitan cities seem to have a reach beyond their local sphere.
Thesis Supervisor: Sinan Aral
Title: David Austin Professor of Management
3
[Page is intentionally left blank]
4
I would like to thank my parents for helping me and supporting me through my studies and enabling me to come to MIT. A lot of people were involved in this endeavour but it would not have been possible without them. I thank my advisor Sinan Aral for his insights both in and outside of class. He helped me a lot to refine the subject to achieve an interesting angle and offered constant advice on the methodology and ideas used in this paper.
Special thanks to Dina Levy-Lambert, who introduced me to Evan Jones at Twitter, guiding me in the right direction when I needed insights on the Twitter API.
Finally, I would like to offer my thanks to everyone who made my experience at MIT enjoyable and full of new learnings while writing the thesis.
5
[Page is intentionally left blank]
6
Introduction
Motivation
Approach
Challenges
Related Work
Geolocation
Sentiment
System Architecture
Tweets Recorder
Retweets Updater
Retweets Recorder
Location Data Retriever
Limitations
Analysis
Data overview
Retweets mapping
Origin Influence
Regression on cities
Propagation outside the U.S.
Sentiment effect
Conclusion
10
10
I1
9
10
I1
20
22
15
18
23
24
26
13
14
14
15
13
13
12
13
7
[Page is intentionally left blank]
8
Twitter' is the 8th 2 most visited website on the web. It has grown from a simple status posting service to supposedly a globalizing force, enabling access to global information. Twitter is not just used by people to get closer to personalities they like. It also has many other usages, such as helping to spread news, offering a light communication platform, informing on trends and feelings, connecting organizations and users and providing a platform for citizen journalism.
Localisation and geotargeting are more and more common in many Internet services. They are used to enrich the user experience and provide specific services based on a user's location. With the rise of mobile, obtaining this information in real-time and utilizing has become more and more achievable. Social media are distinct as distance between individuals is often considered from a social perspective and not from a pure geolocation perspective. Yet, Twitter being not only a
'social' medium but also a new form of media and information propagation tool, it seemed interesting to have a reflexion on the actual proximity of people interacting together on Twitter.
Twitter appears to most users more global than other social media sites, yet there is still little literature on this aspect of the network. But the global aspect of the network has its limits and my analysis helps to explain why that although big and cosmopolitan cities have a global reach, localized cities have almost no global visibility on Twitter.
Because Twitter data is both abundant in terms of content produced but also scarce in terms of tools and ways to analyze it, the approach here was to focus on a specific interaction in the network: retweets and the distance from the original tweet location.
Measuring the reach of a city on Twitter implies that the original location of a tweet might have an impact on its future propagation in the country and beyond. It will certainly not explain the potential of a tweet spreading far and wide but will permit an analysis of the difference between large and local cities and outline the limitations of Twitter as an universal global network.
The dataset created for this analysis consists of two main categories of data points. The first category consists of all the tweets sent from 12 cities in the U.S. during a week. The second category consists of all the retweets associated with the tweets recorded.
i. Data Size
I https://twitter.com/
2 www.alexa.com/siteinfo/twitter.com
9
The community of users on Twitter produces content at an extremely high rate. This is the reason why analyzing Twitter data implied analyzing a significant amount of data, even if the analysis window is relatively small. It also meant that having a suitable tailored infrastructure was important in order to store and manipulate the data, both in real time and for later analysis. Twitter is flexible in the amount of data it provides and built-in filters are not always effective, which meant that a large percentage of data needed to be filtered out after the collection process.
In addition, Twitter reserves the right to restrain the amount of data it provides during peak times to save its infrastructure processing power. After review of the data, this could have happened during the capture and this must be taken into account for longer capturing windows. Finally, even if Twitter allows access to a great deal of data through its public API, they impose numerous restrictions when accessing the data.
ii. Scarcity of geo-data
Since geolocation data is sensitive and in its early stages, the feature is disabled by default in Twitter's services. Some users have enabled it, either because they want to discover new features or they were prompted by Twitter as they are currently expanding the feature. This is why geo-data is still relatively rare and only represents a subset of the data available on Twitter.
iii. Contents of Tweets
Trying to analyze the content of tweets is a genuine challenge. Not only are tweets in multiple languages, they also include a lot of metadata and other artifacts that render analysis complicated. Hashtags, replies, URLs and other items must be stripped out for better readability, and yet they might contain useful infonnation on the content or the user's state of mind when publishing a tweet.
iv. Privacy
Dealing with user-identified data always introduces challenges. Having an identified data set of mood and location raised the issue of user privacy. This analysis tried to protect as much as possible the users' privacy, not only because it is a moral obligation, but also because it is enforced
by strict rules in the Terms and Conditions
3 of the Twitter API. Developers must ensure they use the data with the utmost respect for user privacy and are responsible for using the data in a secured and private environment. To comply with those rules, the present analysis is based on anonymized and aggregated data.
Twitter Developers Terms and Conditions, https://dev.twitter.com/overview/terms/agreement
10
Twitter is an interesting social medium to study as it became an essential part of the communication tool on the Web with 750 million registered users and more than 270 million active users per month
4 in just a couple of years. The network has been widely analyzed and studied thanks to the data it provides through its multiple APIs
5
. A 2010 paper by Haewoon Kwak,
Changhyun Lee, Hosung Park, and Sue Moon entitled What is Twitter, a Social Network or a News
Media? describes the different aspects of the network from the degree of separation of users to the role retweets play on the network. They emphasize that retweets are an effective way to propagate information on Twitter and that the count of followers of the original tweet plays little effect in the capacity for a tweet to be widely retweeted. Although this analysis offers a clearer view on the propagation of a tweet on the network, the actual reach in space is not tackled. The paper explains that since the location is provided by the user without any format constraints, it is very unreliable. It was true in 2010 but the rise of mobile provides a new angle on this past issue. My analysis tries to utilize new ways of tracking user data and push the analysis of retweet propagation beyond the network aspect.
Many web services are building features enabled by the use of mobile phones to utilize geolocation and geotargeting. Twitter offers geotargeting for many of its advertising products
(figure 1) and intends to fully utilize this new set of data.
PROMOTED TWEET PROMOTED ACCOUNT specifitkeyword. flawars of secific accoufts folkytrs of
.peemli
accounts
Gender Yes
Platform/0's Yes
TWitter Pricing* Cost Per Engagemnent
Yes
Engagement
Creatie
.
No
Cost Par Follow
N es
Retweets, replies, favotites, Wkywns or dicks on your ProMoted Account
Follow
140Caracter Tet orLead G
.enerato
Card e.y aur ac; co umn nme atW
Twitter advertising products'
(Fig.
4
Twitter Data: What Do They Represent?, Axel Bruns, Stefan Stieglitz, http://snurb.info/files/2014/Twitter%2OData.pdf
https://dev.twitter.com/rest/public
6 What is Twitter, a Social Network or a News Media? Haewoon Kwak, Changhyun Lee, Hosung Park, and Sue
Moon, http://www.eecs.wsu.edu/~assefaw/CptS580-06/papers/2010-www-twitter.pdf
' Quantcast Twitter Guide 2014, http://info.quantcast.com/rs/quantcast/images/Quantcast%20Twitte-%2OGuide-2014-final.pdf, https://business.twitter.com/solutions/geo-targeting
On the other hand, gathering geolocation data on Twitter users is not utterly advertised by the Twitter API. A potential complementary paper published by Ryan Compton, David Jurgens,
David Allen entitled Geotagging One Hundred Million Twitter Accounts with Total Variation
Minimization8 has caught my attention. They explained how they can use a user's friends and followers on Twitter to infer their location in a 6.38 km radius. They use publicly available data which provide location on a small number of tweets to infer the probable location of users. With the appropriate time and resources, this would have been a great addition to the method used in the present analysis to find the retweets location.
Sentiment analysis is more and more used on Twitter. It allows to get user feedback on specific topics or at certain moment with a large sample size and little experiment bias. More and more tools are trying to provide comprehensive and visual ways to present sentiment data in social media. Agave9 is one example, providing insights and attempts to visual sentiment on timelines to infer changes in moods in the Twitter population.
A few example show that sentiment analysis on Twitter has great promises in this area. It has been used, for example, to analyze the public reaction to presidential debates and was used to reproduce the results of polls before presidential elections. In From Tweets to Polls: Linking Text
Sentiment to Public Opinion Time Series,'
0 the authors present their text analysis methodology and show that Twitter is surprisingly very precise when looking at presidential election preferences.
Geotagging One Hundred Million Twitter Accounts with Total Variation Minimization, Ryan Compton, David
Jurgens, David Allen, http://arxiv.org/pdf/I404.7152v2.pdf
Collaborative Visual Analysis of Sentiment in Twitter Events, Michael Brooks, John J. Robinson and Megan K.
Torkildson, https://students.washington.edu/-jbrooks/wp-content/uploads/2014/1 0/Brooks-et-al.-CDVE2014-Collaborative-vis ual-analysis.pdf
0 From Tweets to Polls: Linking Text Sentiment to Public Opinion Time Series, Brendan O'Connor, Ramnath
Balasubramanyan, Bryan R. Routledge, Noah A. Smith, http://www.aaai.org/ocs/index.php/ICWSM/ICWSM
536/1842/
12
The recording architecture was constituted of two machines for data processing and storage.
One server was running PHP scripts (capturing and processing data) and a MySQL database
(storing data). A second server was monitoring the first one to make sure the system was running with no interruption and to provide real time optimization to the scripts and databases. This system proved to be particularly efficient to ensure reliability when a script would hang or when maintenance was performed by one of the server provider and would make a script crash by mistake.
Some of the code has been optimized during the data collection to answer specific needs and to optimize resources and quality of the data. The description below is the final architecture used to gather the data in my analysis.
The first part of the system was responsible for retrieving, in real-time, all the tweets emitted from the 12 U.S. cities. This module was querying the "streaming" API of Twitter. This
API has few limitations: it only requires a filtering parameter (location or keyword) and an infrastructure fast enough to store the data in real-time. This API returns a JSON with all the available information for a tweet.
Since it was unclear which data would be useful for further analysis and the server used had reasonable power, the choice to process some data in real-time and store the rest for potential later use was made. This allowed for real-time treatment of some data (because PHP is particularly good at treating JSON) and storage of the rest for later need. The next two parts of the system are dedicated to the collection of retweets.
This second part of the system was implemented to work around some limitations of the
Twitter API. It is a script looping through all the tweets already recorded and doing some pre-defined actions. For 100 tweets at a time it would verify if the tweet was still present or had been deleted, check if the tweet had been retweeted, and flag it for potential retweet retrieval by the third module if needed. Since the retweet recorder module could only retrieve retweets at a slow pace (due to a Twitter API limitation explained below), knowing which tweets to treat in priority was the only way to capture data as much in real time as possible.
This third module was in place to record the retweets for the tweets recorded. This process was done in real time for one specific reason: the Twitter API only serves the last 100 retweets for a given tweet. The more often the system would collect retweets, the less chance there was to miss some. Since Twitter enforces quotas on the number of API calls per 15 minutes and since this API
13
call only fetches data for one tweet at a time, having a way to prioritize treatment was essential.
Hence the second module described previously.
The Twitter API used to retrieve the retweets was supposed to return the geo-data of a retweet. The documentation and sample data surface this information, yet the API is actually always returning null data. The last part of the system was designed to circumvent this shortcoming of the
API. This module creates a proxy to retrieve the geo-data. It extracts the userid from a retweet, fetches the last available tweets sent by this user and scraps them for location data based on frequency and timestamp.
Given the way the system was built, the computing resources and the heavy limitations of the Twitter API, two main trade-offs are present in the dataset.
A first trade-off had to be made for tweets with more than 100 retweets. Even with the collection in real-time, given the strain on the system and the extreme virality of some tweets, a lot of retweets have been missed. Only a very small amount of tweets (660, 0.0004% of all tweets,
0.003% of tweets with retweets) are in this case but it would be interesting to get data for those to complete the analysis. Overall, the capture rate was 63.02%. For tweets with less than 100 retweets, the capture rate is 84.76%. For the retweets with more than 100 retweets, the capture rate is less than 1%. This does means that this analysis is exempt of the most viral tweets and it would be interesting to have access to Twitter's archive services (GNIP) and see how the analysis could change with broader data.
The second trade-off applies to location data. Even with a full week of tweets, some small cities have a handful of retweets with location data. This still allows to draw preliminary conclusions but a longer time frame could allow deeper verification.
14
oston
tA
United
States
so r
" Charleston
* Chicago
" Fairfield
* Kansas City
" Las Vegas
" Los Angeles
* New Yorkc a Odessa
* San Francisco
* Springfield a Washington
Mapping of Tweets Recorded (Fig. 2)
15
The following graphs (figures 3, 4 and 5) give more context on the repartition of data and how each city produces content on the network.
Odessa
Springfield
Son
Francisco
Fairfield
Charleston
Kansas City
Boston
Las Vegas
Washington
Chicago
New York
Los Angeles
OK 20K 40K 60K
80K
100K
120K 140K
160K
180K 200K 220K 240K 260K 280K 300K 320K 340K 360K 380K 400K 420K 440K 46C
Tweets
City (Fig. 3)
Odessa
Springfield
Franclco
Fairfield
Charleston
Kansas City
Boston
Las Vegas
Washington
Chicago
New York
Los Angeles
0.01
0.02 0.03 0.04 0.05
0
06 0.07
008
0.09 010 0.11 0.12 0.13 0,14 0.15 0.16 0.17 0.18 0.19 0,20 0,21
Tweets / City adjusted for Population Size (Fig. 4)
16
200~ Las Vegas
M. 22
.73U inn
(Fig. 5)
E Boston
M Chicago
*Fanfietl
U Kansas City
I Los Angeles
a New York a Odessa a San Francisco
Springfield
E Washington
It appears that Twitter's activity is cyclical, where 3AM is the less active period. The two discrepancies in the data (March 18th 9PM and March 24th 5PM) can be explained by the fact that during those peak hours, the Twitter API does not deliver all the tweets to lighten resources on
Twitter's infrastructure. Hence the seemingly low traffic at those times.
The 1.406.331 tweets recorded generated a total of 816,903 retweets captured during the two months following the original capture.
240K
.70
0
45
0.2
0.161
220K
200K
180K
80K
WKm
E E
012
0.11 o I
Retweets / City (Fig. 6) Has at least 1 RT / City (Fig.
7)
.6.
* *
11 ii
RT / Tweet / City (Fig. 8)
I
17
There is definitely a hierarchy not just based on city size. Los Angeles, New York and
Washington are three cities where tweets are especially well broadcast and retweeted.
In order to map the retweets, two data points were needed: the location of the original tweet and the location of the retweet. Since the tweets were captured based on location, all contained location data, often very precisely (longitude/latitude). For the rest, they were tied to a specific location (a building, a park, ... ) with box coordinates. For those, the middle of the box has been taken as the origin point.
For mapping the retweets, as explained in the system architecture, only a subset of them could be mapped to a location using tweets sent by the same user. Out of the 514,820 retweets, captured, 32,551 could be linked to a location. This shows the scarcity of geo data on Twitter as the feature is still being promoted to users. Yet, the ratio between the number of retweets and the number of retweets with location (table 1) is almost the same for every city so the bias introduced by having only a small percentage of retweets mapped should be minimal.
(Table 1)
Springfield
Odessa
Fairfield
San Francisco
Charleston
Kansas City
Boston
Las Vegas
Chicago
Washington
New York
Los Angeles
Before looking at the effect of the origin, figures 9 and dispersion of the data.
10 show an overview of the
0.063
0.064
0.052
0.085
0.076
0.068
0.075
0.056
0.055
0.062
0.067
0.063
a I 7~
"f4
4
Map of Retweets (Fig. 9)
18
Boston Charleston
Chicago
Fairfield Kansas City
Las Vegas
Los Angeles
L ~ i
San Francisco
New York Odessa
Springfield
Washington
Maps of the Retweets per City of origin (Fig. 10)
19
Looking beyond the number of retweets, there are clear patterns on the propagation of tweets depending on the city of origin. Los Angeles, New York or Washington have a more international profile with retweets reaching the two U.S coasts, Europe and Asia. Kansas City or
Charleston seems to have a more regional reach with retweets propagating in the U.S around their point of origin. Odessa or Springfield seems to be much more local with almost no retweets going beyond the state line.
Looking at the earlier maps (figures 9 and 10), there seems to be distinctly different patterns of propagation. One way to try understand the difference of reach between the different cities is to look at the average distance between the origin and the retweet location (figure 11). We can see what the maps show, but this does not just follow the size of the city. (Chicago is as local as
Fairfield, for example.)
City
Odessa
Kansas City
Charleston
Springfield
Chicago
Fairfield
Boston
Washington
Las Vegas
New York
Los Angeles
San Francisco
AVG(KM)
506
883
1,084
1,163
1,608
1,658
1,894
1,994
2,222
2,488
2,570
2,759
1600
7 1500
1400
1300
1200
1100
1000
900 oo
700
600
500
400
300
200
100
0
2800
2700
2600
2500
2400
2300
2200
2100
2000
1900
1800
1700 i=f'c
S S
U,
Retweets Average distance from origin (Fig. 11)
VI
I
In order to understand to what extent a city has a global or local reach, I looked at the percentage of retweets per city depending on the distance from origin. The following graph (figure.
12) shows the percentage of retweets that are within different radii (everything above 10,000 km is in the last interval).
20
0 Waston
.
Fairfield
M
Kansas City
. Las Vegas
M
Los Angeles
0 New York
SOdeso
SSan
Francis=
SSpringfteld
SWashington
Percentage of Retweets per kilometers radius (Fig. 12)
The first thing to notice is that, in any case, 35% or more of the retweets are under 250 km from their origin, and 45% are under 500 km, suggesting that despite being advertised as a global media tool, Twitter is substantially used within a local area.
1, 0
<3500
< 3750
<4000
<4250
< 4500
< 4750
< 5000
<5250
<5500
< 5750
<6000
< 6250
< 6500
<6750
< 7000
<7250
<7500
<7750
<8000
< 8250
< 8500
< 8750
< 9000
< 9250
< 9500
< 9750
<10000
<10250
< 250
< 500
< 750
< 1250
49 55 45 56 56 35 49 39 59 45 44 43
56 62 50 56 67 49 52 46 .
46 56 52
60 69 54 67 53 56 48 53 62 57
<1000
62
64
61 70
70
54 56 51
55 57 56
54 67 62
55 64
<1500 68
< 1750 70
56 58 57
60 59 61
56
56
66
68
63 61 63 57 70 <2000
<2250
< 2500
<2750
65 64 65
68
58
65 67
59
70 66 68 59
< 000
< 325069
68
C
67
(Table 2)
left, presents those percentages and the color indicate thresholds crossed (70%, 80%, 90%). Cities with little
the most global. Finally, the more light blue and medium-dark blue, the more regional.
A city like Odessa proves to be extremely local in the reach of the tweets whereas tweets in cities like
San Francisco or Los Angeles propagate the furthest. A partial explanation of this difference compared to New
York could be that those two cities are further away from Europe. It appears that despite being considered a global media, localized cities seem to have trouble having a reach on the international stage on Twitter.
Tweets further than 10,000 km are all in the last bucket
21
-ktig_ - -An
Table 3 shows the results of the regression of the distance in km (dependent variable) between tweets and retweets against each city (independent variables). Most of the results are significant and show the clear difference in propagation (Boston is the reference in the following table).
City
Charleston
Chicago
Fairfield
Kansas City
Las Vegas
Los Angeles
New York
Odessa
San Francisco
Springfield
Washington
Cons. Boston
Coef.
-810.77
-286.67
-236.80
-1011.46
327.31
675.58
593.76
-1388.06
865.07
-730.93
99.22
1894.36
100.14
268.62
159.16
113.32
87.29
88.81
383.72
222.82
438.07
96.77
80.99
(Table 3)
Std Err. t
204.54
P> It [95% Conf. Interval]
-3.96 0.000 -1211.669 -409.858
-2.86 0.004 -482.9751
-0.88 0.378 -763.3101
-6.35 0.000 -1323.427
2.89 0.004 105.2067
7.74 0.000 504.4957
6.69 0.000 419.6971
-90.40272
289.7138
-699.5007
549.4217
1.03 0.305
23.39 0.000
-90.4606
1735.607
846.6595
767.8154
-3.62 0.000 -2140.168 -635.9551
3.88 0.000 428.3391 1301.793
-1.67 0.095 -1589.552 127.7014
288.9043
2053.106
It appears that Las Vegas, Los Angeles, New York, San Francisco (and Washington though the P-value is too high to draw a strong conclusion) have a more global reach than Boston. On the contrary, Charleston, Chicago, Kansas City, Odessa (,Springfield and Fairfield though the P-value is too high to draw a strong conclusion) are much more local than Boston. This analysis emphasizes the large differences between cities seen earlier.
In order to understand why that is the case, it seemed worthwhile to use some other independent variables that might explain those differences. In the following table (table 4), the distance is still used as the dependent variable.
22
Variables
Population 2013 (k)
Coef.
.062
(Table 4)
Std Err.
.0067 t
9.32
P> It 195% Conf. Interval]
0.000 .0488599 .0748935
Has a large airport" 1117.10 102.94 10.85 0.000 915.3287 1318.866
Per Capita Income -.
.0026896 -4.50 0.000 -.
-.
Cons. 1559.11 154.46 10.09 0.000 1256.365 1861.864
We can see in the table below that the Population plays a small but measurable effect, suggesting that big cities must offer other interesting traits to increase reach. Having a big airport is apparently a strong contributor to the reach. In this case, it would be difficult to know if a big city would have reach and thus build a large airport or if building a large airport contributes to the reach.
There is probably some reverse-causality in this case. The last variable is interesting as it seems that
Per Capita Income plays a small role in influencing the reach, which would suggest that the economical situation of a city plays almost no role in the propagation of tweets.
Defining local can be done using pure distance but also at the country scale. The following graph (figure 13) shows the percentage of retweets that are international (outside the U.S.).
City Name
Odessa
Kansas City
Charleston
Fairfield
Springfield
Chicago
Las Vegas
Boston
San
Francison
Los Angeles
Washington
New York
0% 5% 10% 15% 20% 25% 30% 35% 40% 45% 50% 55%
% of RTs in the US
60% 65% 70% 75% 80% 85% 90% 95%
In US
100% : 1
Percentage of Retweets in/out the U.S. (Fig. 13)
The results are not significantly different, as retweets far away are often outside the U.S. but
Washington seems to be more international on Twitter than San Francisco. This is mainly explained
" Defined by the FAA depending on the role of the airport http://www.wikiwand.com/en/List of airports in the United States#/Primar airports
23
100%
90%
85%
80%
75%
76%
:5%
by the fact that tweets from Washington impact both the regional zone and the rest of the world given that many U.S. government bodies probably tweet from this location.
In this part, the idea was to investigate the propagation of tweets depending on the content of the tweet and if there was difference between cities. The process involves analyzing all the tweets content and try to find out if it is positive or negative. Of course this analysis contains a lot of errors as semantic analysis is complicated. This was done using R and a package called sentiment
12 that tries to rate pieces of strings depending on their content and associate a sentiment (positive, negative, neutral) to it.
The following graph (figure 14) describes the difference in the type of tweets retweeted at different distance radiuses depending of the cities. For each cities, each bar represents the percentage of retweets linked to a type of content for different radiuses (10, 50, 100, 1000 km and all).
S ntmt
* e8*n
35%
30r%
25%
20%
151%
10%
I I
~i7
I
I
Repartition of Sentiment in Retweets based on radius (Fig. 14)
The graph shows that Twitter is not used in the same way at the local scale or global scale.
In San Francisco, local retweets seem to be much more negative than the baseline, whereas in
Charleston this effect is reversed. For the other cities, it is less clear but it appears that more local chatters is associated with more negative tweets. This could suggest a specific usage of Twitter for local issues and that positive tweets have an overall greater reach. This also show that in certain cities, there is a specific usage of Twitter with a probable impact on local communities. This is
'
2
R package sentiment by Timothy Jurka
24
corroborated by the regression (table 5) of the distance (dependent variable) depending on the sentiment in the original tweet (independent variable).
Variables
Tweet is positive
Cons.
Coef.
447.29
1959.56
(Table 5)
Std Err.
40.48 t P> It [95% Conf. Interval]
11.05 0.000 367.9524 526.6254
33.12 59.17 0.000 1894.647 2024.471
25
5.
This paper investigated the local aspect of Twitter and the difference in reach between U.S. cities on the network. It identified several differences between small and big cities in the way tweets propagate and have a local, regional and international reach. It also emphasized the difficulty to extract and provide insights on geolocation data without the proper tools and access. While a lot of features and products are advertised around geotargeting, the reality is that still few people actively enable those features and benefit from them. The literature on the way Twitter supposedly allows for global communication is inconclusive and therefore most of the data is proprietary.
This study sought to answer two questions: Do all cities have a global reach on Twitter'?
and What could explain the difference in reach?
The main findings are described around three axes. What data is accessible and how to gather it?
Twitter does provide multiple access point but the fact is that even with the right infrastructure, the data is not truly public. Also geolocation data is still scarce compared to the amount of data on the network. How to properly visualize the data and verifr the assumptions that cities have an impact on the reach of
When dealing with geolocalized data, maps are a good way to present preliminary assumptions and explore potential patterns. The next step was to synthesize those assumptions and show that the 12 U.S. cities have very different reach and that only very large cities with a coastal presence have a truly global reach. How can the content ofthe tweet be linked
to the way different cities use Twitter? Twitter is also a particularly local medium in the sense that usage of Twitter changes depending on each city and also because Twitter is used for specific reasons at a local scale, as shown by the sentiment analysis.
There are evident limitations to this analysis and having deeper data access would allow further analysis of the most viral aspects of the network. Also, expanding the capture area would allow an exploration of the continental scale and see if the conclusions of the analysis apply outside of the U.S.
However, having a clear view of the potential reach of a given city should enable better consideration of how local information travels across the world. Moreover, geotargeted services are still in early development. Understanding the local aspect of networks such as Twitter and the way people interact with the platform could allow new forms of services and products to emerge alongside the traditional usage of Twitter in global media, for example.
Despite bringing a new way of communicating globally and enabling communication beyond the local area, Twitter is used mainly as a truly local communication tool. Moreover, only big cosmopolitan cities seem to have a truly global reach, whereas small cities struggle to reach beyond their local area of influence.
26