How local is Twitter?

How local is Twitter?

by

Charles Kafoun

MSc in Management and Entrepreneurship, EDHEC 2014

SUBMITTED TO THE MIT SLOAN SCHOOL OF MANAGEMENT IN PARTIAL

FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF

MASTER OF SCIENCE IN MANAGEMENT STUDIES

AT THE

MASSACHUSETTS INSTITUTE OF TECHNOLOGY

JUNE 2015

ARCHIVES

MASSACHUSETTS INSTITUTE

OF TECHNOLOLGY

JUN 2 4 2015

LIBRARIES

2015 Charles Kaioun. All rights reserved.

The author hereby grants to MIT permission to reproduce and to distribute publicly paper and electronic copies of this thesis document in whole or in part in any medium now known or hereafter created.-

Signature of Author:

Certified by:

Accepted by:

S i

Signature redacted

MIT Sloan Scho )I of Management

May 8, 2015

gnature redacted

Sinan Aral

David Austin Professor of Management

Thesis Supervisor

Signature redacted

Michael A. Cusumano

SMR Distinguished Professor of Management

Program Director, M.S. in Management Studies Program

MIT Sloan School of Management

I

I

How local is Twitter?

by

Charles KaYoun

Submitted to MIT Sloan School of Management on May 8, 2015 in Partial Fulfillment of the requirements for the Degree of Master of Science in

Management Studies.

ABSTRACT

This study is an initial attempt to investigate the reach of Twitter depending on the original location of tweets. The major objective is to measure how local or global Twitter's reach is and how localized cities compare to big cosmopolitans cities.

Using a custom data scrapper, this analysis contains data on tweets emitted from 12 U.S. cities as well as the associated retweets. Retweets were mapped to a location and the distance from their point of origin and analyzed to better understand the reach by city. The analysis offers a description of a method to capture such data, the exploration of potential patterns and an attempt to correlate the reach of big cities to specific factors.

On the basis of the results of this analysis, it appears that despite Twitter being considered a global network empowering people for broader communication, small cities have almost no international reach. Twitter appears to be first and foremost a local network, and only cosmopolitan cities seem to have a reach beyond their local sphere.

Thesis Supervisor: Sinan Aral

Title: David Austin Professor of Management

3

[Page is intentionally left blank]

4

Acknowledgments

I would like to thank my parents for helping me and supporting me through my studies and enabling me to come to MIT. A lot of people were involved in this endeavour but it would not have been possible without them. I thank my advisor Sinan Aral for his insights both in and outside of class. He helped me a lot to refine the subject to achieve an interesting angle and offered constant advice on the methodology and ideas used in this paper.

Special thanks to Dina Levy-Lambert, who introduced me to Evan Jones at Twitter, guiding me in the right direction when I needed insights on the Twitter API.

Finally, I would like to offer my thanks to everyone who made my experience at MIT enjoyable and full of new learnings while writing the thesis.

5


6

Introduction

Motivation

Approach

Challenges

Related Work

Twitter

Geolocation

Sentiment

System Architecture

Tweets Recorder

Retweets Updater

Retweets Recorder

Location Data Retriever

Limitations

Analysis

Data overview

Retweets mapping

Origin Influence

Regression on cities

Propagation outside the U.S.

Sentiment effect

Conclusion

Table of Contents

10

10

I1

9

10

I1

11

20

22

15

18

23

24

26

13

14

14

15

13

13

12

13

7


8

1. Introduction a. Motivation

Twitter' is the 8th 2 most visited website on the web. It has grown from a simple status posting service to supposedly a globalizing force, enabling access to global information. Twitter is not just used by people to get closer to personalities they like. It also has many other usages, such as helping to spread news, offering a light communication platform, informing on trends and feelings, connecting organizations and users and providing a platform for citizen journalism.

Localisation and geotargeting are more and more common in many Internet services. They are used to enrich the user experience and provide specific services based on a user's location. With the rise of mobile, obtaining this information in real-time and utilizing has become more and more achievable. Social media are distinct as distance between individuals is often considered from a social perspective and not from a pure geolocation perspective. Yet, Twitter being not only a

'social' medium but also a new form of media and information propagation tool, it seemed interesting to have a reflexion on the actual proximity of people interacting together on Twitter.

Twitter appears to most users more global than other social media sites, yet there is still little literature on this aspect of the network. But the global aspect of the network has its limits and my analysis helps to explain why that although big and cosmopolitan cities have a global reach, localized cities have almost no global visibility on Twitter.

b. Approach

Because Twitter data is both abundant in terms of content produced but also scarce in terms of tools and ways to analyze it, the approach here was to focus on a specific interaction in the network: retweets and the distance from the original tweet location.

Measuring the reach of a city on Twitter implies that the original location of a tweet might have an impact on its future propagation in the country and beyond. It will certainly not explain the potential of a tweet spreading far and wide but will permit an analysis of the difference between large and local cities and outline the limitations of Twitter as an universal global network.

The dataset created for this analysis consists of two main categories of data points. The first category consists of all the tweets sent from 12 cities in the U.S. during a week. The second category consists of all the retweets associated with the tweets recorded.

c. Challenges

i. Data Size

I https://twitter.com/

2 www.alexa.com/siteinfo/twitter.com

9

The community of users on Twitter produces content at an extremely high rate. This is the reason why analyzing Twitter data implied analyzing a significant amount of data, even if the analysis window is relatively small. It also meant that having a suitable tailored infrastructure was important in order to store and manipulate the data, both in real time and for later analysis. Twitter is flexible in the amount of data it provides and built-in filters are not always effective, which meant that a large percentage of data needed to be filtered out after the collection process.

In addition, Twitter reserves the right to restrain the amount of data it provides during peak times to save its infrastructure processing power. After review of the data, this could have happened during the capture and this must be taken into account for longer capturing windows. Finally, even if Twitter allows access to a great deal of data through its public API, they impose numerous restrictions when accessing the data.

ii. Scarcity of geo-data

Since geolocation data is sensitive and in its early stages, the feature is disabled by default in Twitter's services. Some users have enabled it, either because they want to discover new features or they were prompted by Twitter as they are currently expanding the feature. This is why geo-data is still relatively rare and only represents a subset of the data available on Twitter.

iii. Contents of Tweets

Trying to analyze the content of tweets is a genuine challenge. Not only are tweets in multiple languages, they also include a lot of metadata and other artifacts that render analysis complicated. Hashtags, replies, URLs and other items must be stripped out for better readability, and yet they might contain useful infonnation on the content or the user's state of mind when publishing a tweet.

iv. Privacy

Dealing with user-identified data always introduces challenges. Having an identified data set of mood and location raised the issue of user privacy. This analysis tried to protect as much as possible the users' privacy, not only because it is a moral obligation, but also because it is enforced

by strict rules in the Terms and Conditions

3 of the Twitter API. Developers must ensure they use the data with the utmost respect for user privacy and are responsible for using the data in a secured and private environment. To comply with those rules, the present analysis is based on anonymized and aggregated data.

Twitter Developers Terms and Conditions, https://dev.twitter.com/overview/terms/agreement

10

2. Related Work a. Twitter

Twitter is an interesting social medium to study as it became an essential part of the communication tool on the Web with 750 million registered users and more than 270 million active users per month

4 in just a couple of years. The network has been widely analyzed and studied thanks to the data it provides through its multiple APIs

5

. A 2010 paper by Haewoon Kwak,

Changhyun Lee, Hosung Park, and Sue Moon entitled What is Twitter, a Social Network or a News

Media? describes the different aspects of the network from the degree of separation of users to the role retweets play on the network. They emphasize that retweets are an effective way to propagate information on Twitter and that the count of followers of the original tweet plays little effect in the capacity for a tweet to be widely retweeted. Although this analysis offers a clearer view on the propagation of a tweet on the network, the actual reach in space is not tackled. The paper explains that since the location is provided by the user without any format constraints, it is very unreliable. It was true in 2010 but the rise of mobile provides a new angle on this past issue. My analysis tries to utilize new ways of tracking user data and push the analysis of retweet propagation beyond the network aspect.

b. Geolocation

Many web services are building features enabled by the use of mobile phones to utilize geolocation and geotargeting. Twitter offers geotargeting for many of its advertising products

(figure 1) and intends to fully utilize this new set of data.

PROMOTED TWEET PROMOTED ACCOUNT specifitkeyword. flawars of secific accoufts folkytrs of

.peemli

accounts

Gender Yes

Platform/0's Yes

TWitter Pricing* Cost Per Engagemnent

Yes

Engagement

Creatie

.

No

Cost Par Follow

N es

Retweets, replies, favotites, Wkywns or dicks on your ProMoted Account

Follow

140Caracter Tet orLead G

.enerato

Card e.y aur ac; co umn nme atW

Twitter advertising products'

(Fig.

1)

4

Twitter Data: What Do They Represent?, Axel Bruns, Stefan Stieglitz, http://snurb.info/files/2014/Twitter%2OData.pdf

5

https://dev.twitter.com/rest/public

6 What is Twitter, a Social Network or a News Media? Haewoon Kwak, Changhyun Lee, Hosung Park, and Sue

Moon, http://www.eecs.wsu.edu/~assefaw/CptS580-06/papers/2010-www-twitter.pdf

' Quantcast Twitter Guide 2014, http://info.quantcast.com/rs/quantcast/images/Quantcast%20Twitte-%2OGuide-2014-final.pdf, https://business.twitter.com/solutions/geo-targeting

11

On the other hand, gathering geolocation data on Twitter users is not utterly advertised by the Twitter API. A potential complementary paper published by Ryan Compton, David Jurgens,

David Allen entitled Geotagging One Hundred Million Twitter Accounts with Total Variation

Minimization8 has caught my attention. They explained how they can use a user's friends and followers on Twitter to infer their location in a 6.38 km radius. They use publicly available data which provide location on a small number of tweets to infer the probable location of users. With the appropriate time and resources, this would have been a great addition to the method used in the present analysis to find the retweets location.

c. Sentiment

Sentiment analysis is more and more used on Twitter. It allows to get user feedback on specific topics or at certain moment with a large sample size and little experiment bias. More and more tools are trying to provide comprehensive and visual ways to present sentiment data in social media. Agave9 is one example, providing insights and attempts to visual sentiment on timelines to infer changes in moods in the Twitter population.

A few example show that sentiment analysis on Twitter has great promises in this area. It has been used, for example, to analyze the public reaction to presidential debates and was used to reproduce the results of polls before presidential elections. In From Tweets to Polls: Linking Text

Sentiment to Public Opinion Time Series,'

0 the authors present their text analysis methodology and show that Twitter is surprisingly very precise when looking at presidential election preferences.

Geotagging One Hundred Million Twitter Accounts with Total Variation Minimization, Ryan Compton, David

Jurgens, David Allen, http://arxiv.org/pdf/I404.7152v2.pdf

9

Collaborative Visual Analysis of Sentiment in Twitter Events, Michael Brooks, John J. Robinson and Megan K.

Torkildson, https://students.washington.edu/-jbrooks/wp-content/uploads/2014/1 0/Brooks-et-al.-CDVE2014-Collaborative-vis ual-analysis.pdf

0 From Tweets to Polls: Linking Text Sentiment to Public Opinion Time Series, Brendan O'Connor, Ramnath

Balasubramanyan, Bryan R. Routledge, Noah A. Smith, http://www.aaai.org/ocs/index.php/ICWSM/ICWSM

10/paper/viewFile/1

536/1842/

12

3. System Architecture

The recording architecture was constituted of two machines for data processing and storage.

One server was running PHP scripts (capturing and processing data) and a MySQL database

(storing data). A second server was monitoring the first one to make sure the system was running with no interruption and to provide real time optimization to the scripts and databases. This system proved to be particularly efficient to ensure reliability when a script would hang or when maintenance was performed by one of the server provider and would make a script crash by mistake.

Some of the code has been optimized during the data collection to answer specific needs and to optimize resources and quality of the data. The description below is the final architecture used to gather the data in my analysis.

a. Tweets Recorder

The first part of the system was responsible for retrieving, in real-time, all the tweets emitted from the 12 U.S. cities. This module was querying the "streaming" API of Twitter. This

API has few limitations: it only requires a filtering parameter (location or keyword) and an infrastructure fast enough to store the data in real-time. This API returns a JSON with all the available information for a tweet.

Since it was unclear which data would be useful for further analysis and the server used had reasonable power, the choice to process some data in real-time and store the rest for potential later use was made. This allowed for real-time treatment of some data (because PHP is particularly good at treating JSON) and storage of the rest for later need. The next two parts of the system are dedicated to the collection of retweets.

b. Retweets Updater

This second part of the system was implemented to work around some limitations of the

Twitter API. It is a script looping through all the tweets already recorded and doing some pre-defined actions. For 100 tweets at a time it would verify if the tweet was still present or had been deleted, check if the tweet had been retweeted, and flag it for potential retweet retrieval by the third module if needed. Since the retweet recorder module could only retrieve retweets at a slow pace (due to a Twitter API limitation explained below), knowing which tweets to treat in priority was the only way to capture data as much in real time as possible.

c. Retweets Recorder

This third module was in place to record the retweets for the tweets recorded. This process was done in real time for one specific reason: the Twitter API only serves the last 100 retweets for a given tweet. The more often the system would collect retweets, the less chance there was to miss some. Since Twitter enforces quotas on the number of API calls per 15 minutes and since this API

13

call only fetches data for one tweet at a time, having a way to prioritize treatment was essential.

Hence the second module described previously.

d. Location Data Retriever

The Twitter API used to retrieve the retweets was supposed to return the geo-data of a retweet. The documentation and sample data surface this information, yet the API is actually always returning null data. The last part of the system was designed to circumvent this shortcoming of the

API. This module creates a proxy to retrieve the geo-data. It extracts the userid from a retweet, fetches the last available tweets sent by this user and scraps them for location data based on frequency and timestamp.

e. Limitations

Given the way the system was built, the computing resources and the heavy limitations of the Twitter API, two main trade-offs are present in the dataset.

A first trade-off had to be made for tweets with more than 100 retweets. Even with the collection in real-time, given the strain on the system and the extreme virality of some tweets, a lot of retweets have been missed. Only a very small amount of tweets (660, 0.0004% of all tweets,

0.003% of tweets with retweets) are in this case but it would be interesting to get data for those to complete the analysis. Overall, the capture rate was 63.02%. For tweets with less than 100 retweets, the capture rate is 84.76%. For the retweets with more than 100 retweets, the capture rate is less than 1%. This does means that this analysis is exempt of the most viral tweets and it would be interesting to have access to Twitter's archive services (GNIP) and see how the analysis could change with broader data.

The second trade-off applies to location data. Even with a full week of tweets, some small cities have a handful of retweets with location data. This still allows to draw preliminary conclusions but a longer time frame could allow deeper verification.

14

oston

4. Analysis

a. Data overview

The collection of tweets was done during 7 days. This period was chosen in order to get a significant amount of data and to account for any potential changes during a week's time. The collection of retweets was done over the next two months to get as many retweets as possible.

A total of 1,406,331 tweets were recorded. They originated from 12 U.S. cities mapped in figure 2: New York; Los Angeles; Chicago; San Francisco; Boston; Washington; Las Vegas;

Kansas City, Missouri; Fairfield, California; Odessa, Texas; Springfield, Illinois and Charleston,

South-Carolina.

Those cities were chosen for their different sizes in population and areas as well as different location in the U.S.

tA

United

States

is

so r

" Charleston

* Chicago

" Fairfield

* Kansas City

" Las Vegas

" Los Angeles

* New Yorkc a Odessa

* San Francisco

* Springfield a Washington

Mapping of Tweets Recorded (Fig. 2)

15

The following graphs (figures 3, 4 and 5) give more context on the repartition of data and how each city produces content on the network.

Odessa

Springfield

Son

Francisco

Fairfield

Charleston

Kansas City

Boston

Las Vegas

Washington

Chicago

New York

Los Angeles

OK 20K 40K 60K

80K

100K

120K 140K

160K

180K 200K 220K 240K 260K 280K 300K 320K 340K 360K 380K 400K 420K 440K 46C

Tweets

/

City (Fig. 3)

Odessa

Springfield

Franclco

Fairfield

Charleston

Kansas City

Boston

Las Vegas

Washington

Chicago

New York

Los Angeles

0.01

0.02 0.03 0.04 0.05

0

06 0.07

008

0.09 010 0.11 0.12 0.13 0,14 0.15 0.16 0.17 0.18 0.19 0,20 0,21

Tweets / City adjusted for Population Size (Fig. 4)

16

200~ Las Vegas

M. 22

.73U inn

Repartition over Time by City

(Fig. 5)

E Boston

M Chicago

*Fanfietl

U Kansas City

I Los Angeles

a New York a Odessa a San Francisco

Springfield

E Washington

It appears that Twitter's activity is cyclical, where 3AM is the less active period. The two discrepancies in the data (March 18th 9PM and March 24th 5PM) can be explained by the fact that during those peak hours, the Twitter API does not deliver all the tweets to lighten resources on

Twitter's infrastructure. Hence the seemingly low traffic at those times.

The 1.406.331 tweets recorded generated a total of 816,903 retweets captured during the two months following the original capture.

240K

.70

0

45

0.2

0.161

220K

200K

180K

80K

WKm

E E

012

0.11 o I

Retweets / City (Fig. 6) Has at least 1 RT / City (Fig.

7)

.6.

* *

11 ii

RT / Tweet / City (Fig. 8)

I

17

There is definitely a hierarchy not just based on city size. Los Angeles, New York and

Washington are three cities where tweets are especially well broadcast and retweeted.

b. Retweets mapping

In order to map the retweets, two data points were needed: the location of the original tweet and the location of the retweet. Since the tweets were captured based on location, all contained location data, often very precisely (longitude/latitude). For the rest, they were tied to a specific location (a building, a park, ... ) with box coordinates. For those, the middle of the box has been taken as the origin point.

For mapping the retweets, as explained in the system architecture, only a subset of them could be mapped to a location using tweets sent by the same user. Out of the 514,820 retweets, captured, 32,551 could be linked to a location. This shows the scarcity of geo data on Twitter as the feature is still being promoted to users. Yet, the ratio between the number of retweets and the number of retweets with location (table 1) is almost the same for every city so the bias introduced by having only a small percentage of retweets mapped should be minimal.

(Table 1)

Springfield

Odessa

Fairfield

San Francisco

Charleston

Kansas City

Boston

Las Vegas

Chicago

Washington

New York

Los Angeles

Before looking at the effect of the origin, figures 9 and dispersion of the data.

10 show an overview of the

0.063

0.064

0.052

0.085

0.076

0.068

0.075

0.056

0.055

0.062

0.067

0.063

a I 7~

"f4

4

Map of Retweets (Fig. 9)

18

Boston Charleston

Chicago

Fairfield Kansas City

Las Vegas

Los Angeles

L ~ i

San Francisco

New York Odessa

Springfield

Washington

Maps of the Retweets per City of origin (Fig. 10)

19

Looking beyond the number of retweets, there are clear patterns on the propagation of tweets depending on the city of origin. Los Angeles, New York or Washington have a more international profile with retweets reaching the two U.S coasts, Europe and Asia. Kansas City or

Charleston seems to have a more regional reach with retweets propagating in the U.S around their point of origin. Odessa or Springfield seems to be much more local with almost no retweets going beyond the state line.

c. Origin Influence

Looking at the earlier maps (figures 9 and 10), there seems to be distinctly different patterns of propagation. One way to try understand the difference of reach between the different cities is to look at the average distance between the origin and the retweet location (figure 11). We can see what the maps show, but this does not just follow the size of the city. (Chicago is as local as

Fairfield, for example.)

City

Odessa

Kansas City

Charleston

Springfield

Chicago

Fairfield

Boston

Washington

Las Vegas

New York

Los Angeles

San Francisco

AVG(KM)

506

883

1,084

1,163

1,608

1,658

1,894

1,994

2,222

2,488

2,570

2,759

1600

7 1500

1400

1300

1200

1100

1000

900 oo

700

600

500

400

300

200

100

0

2800

2700

2600

2500

2400

2300

2200

2100

2000

1900

1800

1700 i=f'c

S S

U,

Retweets Average distance from origin (Fig. 11)

VI

I

In order to understand to what extent a city has a global or local reach, I looked at the percentage of retweets per city depending on the distance from origin. The following graph (figure.

12) shows the percentage of retweets that are within different radii (everything above 10,000 km is in the last interval).

20

0 Waston

.

Fairfield

M

Kansas City

. Las Vegas

M

Los Angeles

0 New York

SOdeso

SSan

Francis=

SSpringfteld

SWashington

Percentage of Retweets per kilometers radius (Fig. 12)

The first thing to notice is that, in any case, 35% or more of the retweets are under 250 km from their origin, and 45% are under 500 km, suggesting that despite being advertised as a global media tool, Twitter is substantially used within a local area.

1, 0

<3500

< 3750

<4000

<4250

< 4500

< 4750

< 5000

<5250

<5500

< 5750

<6000

< 6250

< 6500

<6750

< 7000

<7250

<7500

<7750

<8000

< 8250

< 8500

< 8750

< 9000

< 9250

< 9500

< 9750

<10000

<10250

< 250

< 500

< 750

< 1250

49 55 45 56 56 35 49 39 59 45 44 43

56 62 50 56 67 49 52 46 .

46 56 52

60 69 54 67 53 56 48 53 62 57

<1000

62

64

61 70

70

54 56 51

55 57 56

54 67 62

55 64

<1500 68

< 1750 70

56 58 57

60 59 61

56

56

66

68

63 61 63 57 70 <2000

<2250

< 2500

<2750

65 64 65

68

58

65 67

59

70 66 68 59

< 000

< 325069

68

C

67

(Table 2)

In trying to see which cities are the most local vs global in term of reach, it is interesting to look at the radius from the origin required to capture a given percentage of the retweets. Table 2, on the

left, presents those percentages and the color indicate thresholds crossed (70%, 80%, 90%). Cities with little

white are the most local. Cities with little dark blue are

the most global. Finally, the more light blue and medium-dark blue, the more regional.

A city like Odessa proves to be extremely local in the reach of the tweets whereas tweets in cities like

San Francisco or Los Angeles propagate the furthest. A partial explanation of this difference compared to New

York could be that those two cities are further away from Europe. It appears that despite being considered a global media, localized cities seem to have trouble having a reach on the international stage on Twitter.

Tweets further than 10,000 km are all in the last bucket

21

-ktig_ - -An

d. Regression on cities

Table 3 shows the results of the regression of the distance in km (dependent variable) between tweets and retweets against each city (independent variables). Most of the results are significant and show the clear difference in propagation (Boston is the reference in the following table).

City

Charleston

Chicago

Fairfield

Kansas City

Las Vegas

Los Angeles

New York

Odessa

San Francisco

Springfield

Washington

Cons. Boston

Coef.

-810.77

-286.67

-236.80

-1011.46

327.31

675.58

593.76

-1388.06

865.07

-730.93

99.22

1894.36

100.14

268.62

159.16

113.32

87.29

88.81

383.72

222.82

438.07

96.77

80.99

(Table 3)

Std Err. t

204.54

P> It [95% Conf. Interval]

-3.96 0.000 -1211.669 -409.858

-2.86 0.004 -482.9751

-0.88 0.378 -763.3101

-6.35 0.000 -1323.427

2.89 0.004 105.2067

7.74 0.000 504.4957

6.69 0.000 419.6971

-90.40272

289.7138

-699.5007

549.4217

1.03 0.305

23.39 0.000

-90.4606

1735.607

846.6595

767.8154

-3.62 0.000 -2140.168 -635.9551

3.88 0.000 428.3391 1301.793

-1.67 0.095 -1589.552 127.7014

288.9043

2053.106

It appears that Las Vegas, Los Angeles, New York, San Francisco (and Washington though the P-value is too high to draw a strong conclusion) have a more global reach than Boston. On the contrary, Charleston, Chicago, Kansas City, Odessa (,Springfield and Fairfield though the P-value is too high to draw a strong conclusion) are much more local than Boston. This analysis emphasizes the large differences between cities seen earlier.

In order to understand why that is the case, it seemed worthwhile to use some other independent variables that might explain those differences. In the following table (table 4), the distance is still used as the dependent variable.

22

Variables

Population 2013 (k)

Coef.

.062

(Table 4)

Std Err.

.0067 t

9.32

P> It 195% Conf. Interval]

0.000 .0488599 .0748935

Has a large airport" 1117.10 102.94 10.85 0.000 915.3287 1318.866

Per Capita Income -.

.0026896 -4.50 0.000 -.

-.

Cons. 1559.11 154.46 10.09 0.000 1256.365 1861.864

We can see in the table below that the Population plays a small but measurable effect, suggesting that big cities must offer other interesting traits to increase reach. Having a big airport is apparently a strong contributor to the reach. In this case, it would be difficult to know if a big city would have reach and thus build a large airport or if building a large airport contributes to the reach.

There is probably some reverse-causality in this case. The last variable is interesting as it seems that

Per Capita Income plays a small role in influencing the reach, which would suggest that the economical situation of a city plays almost no role in the propagation of tweets.

e. Propagation outside the U.S.

Defining local can be done using pure distance but also at the country scale. The following graph (figure 13) shows the percentage of retweets that are international (outside the U.S.).

City Name

Odessa

Kansas City

Charleston

Fairfield

Springfield

Chicago

Las Vegas

Boston

San

Francison

Los Angeles

Washington

New York

0% 5% 10% 15% 20% 25% 30% 35% 40% 45% 50% 55%

% of RTs in the US

60% 65% 70% 75% 80% 85% 90% 95%

In US

100% : 1

Percentage of Retweets in/out the U.S. (Fig. 13)

The results are not significantly different, as retweets far away are often outside the U.S. but

Washington seems to be more international on Twitter than San Francisco. This is mainly explained

" Defined by the FAA depending on the role of the airport http://www.wikiwand.com/en/List of airports in the United States#/Primar airports

23

100%

90%

85%

80%

75%

76%

:5%

255%

by the fact that tweets from Washington impact both the regional zone and the rest of the world given that many U.S. government bodies probably tweet from this location.

f. Sentiment effect

In this part, the idea was to investigate the propagation of tweets depending on the content of the tweet and if there was difference between cities. The process involves analyzing all the tweets content and try to find out if it is positive or negative. Of course this analysis contains a lot of errors as semantic analysis is complicated. This was done using R and a package called sentiment

12 that tries to rate pieces of strings depending on their content and associate a sentiment (positive, negative, neutral) to it.

The following graph (figure 14) describes the difference in the type of tweets retweeted at different distance radiuses depending of the cities. For each cities, each bar represents the percentage of retweets linked to a type of content for different radiuses (10, 50, 100, 1000 km and all).

S ntmt

* e8*n

35%

30r%

25%

20%

151%

10%

I I

~i7

I

I

ii

I

I

Repartition of Sentiment in Retweets based on radius (Fig. 14)

The graph shows that Twitter is not used in the same way at the local scale or global scale.

In San Francisco, local retweets seem to be much more negative than the baseline, whereas in

Charleston this effect is reversed. For the other cities, it is less clear but it appears that more local chatters is associated with more negative tweets. This could suggest a specific usage of Twitter for local issues and that positive tweets have an overall greater reach. This also show that in certain cities, there is a specific usage of Twitter with a probable impact on local communities. This is

'

2

R package sentiment by Timothy Jurka

24

corroborated by the regression (table 5) of the distance (dependent variable) depending on the sentiment in the original tweet (independent variable).

Variables

Tweet is positive

Cons.

Coef.

447.29

1959.56

(Table 5)

Std Err.

40.48 t P> It [95% Conf. Interval]

11.05 0.000 367.9524 526.6254

33.12 59.17 0.000 1894.647 2024.471

25

5.

Conclusion

This paper investigated the local aspect of Twitter and the difference in reach between U.S. cities on the network. It identified several differences between small and big cities in the way tweets propagate and have a local, regional and international reach. It also emphasized the difficulty to extract and provide insights on geolocation data without the proper tools and access. While a lot of features and products are advertised around geotargeting, the reality is that still few people actively enable those features and benefit from them. The literature on the way Twitter supposedly allows for global communication is inconclusive and therefore most of the data is proprietary.

This study sought to answer two questions: Do all cities have a global reach on Twitter'?

and What could explain the difference in reach?

The main findings are described around three axes. What data is accessible and how to gather it?

Twitter does provide multiple access point but the fact is that even with the right infrastructure, the data is not truly public. Also geolocation data is still scarce compared to the amount of data on the network. How to properly visualize the data and verifr the assumptions that cities have an impact on the reach of

tweets?

When dealing with geolocalized data, maps are a good way to present preliminary assumptions and explore potential patterns. The next step was to synthesize those assumptions and show that the 12 U.S. cities have very different reach and that only very large cities with a coastal presence have a truly global reach. How can the content ofthe tweet be linked

to the way different cities use Twitter? Twitter is also a particularly local medium in the sense that usage of Twitter changes depending on each city and also because Twitter is used for specific reasons at a local scale, as shown by the sentiment analysis.

There are evident limitations to this analysis and having deeper data access would allow further analysis of the most viral aspects of the network. Also, expanding the capture area would allow an exploration of the continental scale and see if the conclusions of the analysis apply outside of the U.S.

However, having a clear view of the potential reach of a given city should enable better consideration of how local information travels across the world. Moreover, geotargeted services are still in early development. Understanding the local aspect of networks such as Twitter and the way people interact with the platform could allow new forms of services and products to emerge alongside the traditional usage of Twitter in global media, for example.

Despite bringing a new way of communicating globally and enabling communication beyond the local area, Twitter is used mainly as a truly local communication tool. Moreover, only big cosmopolitan cities seem to have a truly global reach, whereas small cities struggle to reach beyond their local area of influence.

26

How local is Twitter?

How local is Twitter?

Charles Kafoun

ARCHIVES

LIBRARIES

Signature redacted

gnature redacted

Signature redacted

I

How local is Twitter?

Charles KaYoun

Acknowledgments

Table of Contents

11

1. Introduction a. Motivation

b. Approach

c. Challenges

2. Related Work a. Twitter

b. Geolocation

1)

5

11

c. Sentiment

9

10/paper/viewFile/1

3. System Architecture

a. Tweets Recorder

b. Retweets Updater

c. Retweets Recorder

d. Location Data Retriever

e. Limitations

4. Analysis

a. Data overview

The collection of tweets was done during 7 days. This period was chosen in order to get a significant amount of data and to account for any potential changes during a week's time. The collection of retweets was done over the next two months to get as many retweets as possible.

A total of 1,406,331 tweets were recorded. They originated from 12 U.S. cities mapped in figure 2: New York; Los Angeles; Chicago; San Francisco; Boston; Washington; Las Vegas;

Kansas City, Missouri; Fairfield, California; Odessa, Texas; Springfield, Illinois and Charleston,

South-Carolina.

Those cities were chosen for their different sizes in population and areas as well as different location in the U.S.

is

/

Repartition over Time by City

b. Retweets mapping

c. Origin Influence

In trying to see which cities are the most local vs global in term of reach, it is interesting to look at the radius from the origin required to capture a given percentage of the retweets. Table 2, on the

white are the most local. Cities with little dark blue are

d. Regression on cities

e. Propagation outside the U.S.

255%

f. Sentiment effect

I

ii

I

Conclusion

tweets?

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib