On the Study of Diurnal Urban Routines on Twitter Mor Naaman Amy X Zhang Samuel Brody Gilad Lotan SC&I Rutgers University mor@rutgers.edu Computer Laboratory University of Cambridge amy.xian.zhang@gmail.com Google∗ sdbrody@gmail.com Social Flow, Inc. gilad@socialflow.com Abstract Social media activity in different geographic regions can expose a varied set of temporal patterns. We study and characterize diurnal patterns in social media data for different urban areas, with the goal of providing context and framing for reasoning about such patterns at different scales. Using one of the largest datasets to date of Twitter content associated with different locations, we examine within-day variability and across-day variability of diurnal keyword patterns for different locations. We show that only a few cities currently provide the magnitude of content needed to support such acrossday variability analysis for more than a few keywords. Nevertheless, within-day diurnal variability can help in comparing activities and finding similarities between cities. Introduction Social media activity in different geographic regions expose a varied set of temporal patterns. In particular, Social Awareness Streams (SAS) (Naaman, Boase, and Lai 2010), available from social media services such as Facebook, Twitter, FourSquare, Flickr, and others, allow users to post streams of lightweight content artifacts, from short status messages to links, pictures, and videos, in a highly connected social environment. The vast amounts of SAS data reflect, in new ways, people’s attitudes, attention, and interests, offering unique opportunities to understand and draw insights about social trends and habits. In this paper, we focus on characterizing social media patterns in different urban areas (US cities), with the goal of providing a framework for reasoning about activities and diurnal patterns in different cities. Using Twitter as a typical SAS, previous research studied specific temporal patterns that are similar across geographies, in particular in respect to expression of mood (Golder and Macy 2011; Dodds et al. 2011). We aim to provide insights for reasoning about diurnal patterns in different geographic (urban) areas that can be used in studying activity patterns in these areas, going beyond previous work that had mostly examined topical differences between posts in different geographic areas (Eisenstein et al. 2010; Hecht et al. 2011) or briefly examined broad diurnal differences (Cheng et al. 2011) in vol∗ Amy and Sam were at Rutgers at the time of this work. c 2012, Association for the Advancement of Artificial Copyright Intelligence (www.aaai.org). All rights reserved. ume between cities. Such study can contribute to urban studies, with implications for diverse social challenges such as public health, emergency response, community safety, transportation, and resource planning as well as Internet advertising, providing insights and information that cannot readily be extracted from other sources. Developing such a framework presents a number of challenges, both technical and practical. First, SAS data (and in particular Twitter) has been shown to be quite noisy. Users of SAS post different type of content, from information and link sharing, to personal updates, to social interactions, and many others (Naaman, Boase, and Lai 2010). Can stable patterns be reliably extracted given this noisy environment? Second, reliably extracting the location associated with Twitter content is still an open problem, as we discuss below. Finally, Twitter content volume shifts over time as more users join the service, and fluctuates widely in response to breaking events and other happenings, from Valentine’s Day to the news about Bin Laden’s capture and demise. Such temporal volume fluctuations might distort otherwise stable patterns and make them difficult to extract. In this paper, therefore, we report on a study that extracts and reasons about stable temporal patterns from Twitter data. In particular, we: 1) use large scale data with manual coding to get a wide sample of tweets for different cities; 2) study within-day and across-day variability of patterns in cities; and 3) reason about differences between cities with respect to overall patterns as well as individual ones. Related Work Broadly speaking, this work is informed by two key areas of related work: the use of new technologies and data sources for urban studies, and studies of social media to extract “real world” insights, or temporal dynamics. Here we broadly address these areas, before discussing other recent research that directly informed our work. The related research area sometimes dubbed “urban sensing” (Cuff, Hansen, and Kang 2008) analyzes various new datasets to understand the dynamics and patterns of urban activity. Most prominently, mobile phone data, mainly proprietary data from wireless carriers (e.g., calls made and positioning data) help expose travel patterns and broad spatio-temporal dynamics, e.g., in (Gonzalez, Hidalgo, and Barabasi 2008). Social media was also used to augment our understanding of urban spaces. Researchers have used geotagged photographs, for example, to gain insight about tourist activities in cities (Ahern et al. 2007; Crandall et al. 2009; Girardin et al. 2008). Twitter data can augment and improve on these research efforts, and allow for new insights about communities and urban environments. More recently, as mentioned above, researchers had examined differences in social media content between geographies using keyword and topic models only (Eisenstein et al. 2010; Hecht et al. 2011). Cheng et al. (2011) examine patterns of “checkins” on Foursquare and briefly report on differences between cities regarding their global diurnal patterns. Beyond geographies and urban spaces, several research efforts have examined social media temporal patterns and dynamics. Researchers have examined daily and weekly temporal patterns on Facebook (Golder, Wilkinson, and Huberman 2007), and to some degree Twitter (Java et al. 2007), but did not address the stability of patterns, or the differences between geographic regions. Recently, Golder and Macy (2011) have examined temporal variation related to Twitter posts reflecting mood, across different locations, and showed that diurnal (as well as seasonal) mood patterns are robust and consistent in many cultures. The activity and volume measures we use here are similar to Golder and Macys, but we study patterns more broadly (in terms of keywords) and in a more focused geography (city-scale instead of timezone and country). Identifying and reasoning about repeating patterns in time series has, of course, long been a topic of study in many domains. Most closely related to our domain, Shimshoni et al. (Shimshoni, Efron, and Matias 2009) examined the predictability of search query patterns using day-level data. In their work, the authors model seasonal and overall trend components to predict search query traffic for different keywords. The “predicitability” criteria, though, is arbitrary, and only used to compare between different categories of use. Definitions We begin with a simple definition of the Twitter content as used in this work. Users are marked u ∈ U , where u can be minimally modeled by a user ID. However, the Twitter system features additional information about the user, most notably their hometown location `u and a short “profile” description. Content items (i.e., tweets) are marked m ∈ M , where m can be minimally modeled by a tuple (um , cm , tm , `m ) containing the identity of the user posting the message um ∈ U , the content of the message cm , the posting time tm and, increasingly, the location of the user at the time of posting `m . This simple model captures the essence of the activity in many different SAS platforms, although we focus on Twitter in this work. Using these building blocks, we now formalize some aggregate concepts that will be used in this paper. In particular, we are interested in content for a given geographic area, and examining the content based on diurnal (hourly) patterns. We use the following formulations: • MG,w (d, h) are content items associated with a word w posted in geographic area G during hour h of day d. • XG,w (d, h) = |MG,w (d, h)| defines a time series of the volume of messages associated with keyword w in location G. In other words, XG,w (d, h) would be the volume of messages (tweets) in region G that include the word w and were posted during hour h = 0 . . . 23 of day d = 1 . . . N , where the N is the number of the days in our dataset. We describe below how these variables are computed. Data Collection In this section, we describe the data collection and aggregation methodology. We first show how we collected a set of tweets MG for every location G in our dataset. We then describe how we select a set of keywords w for the analysis, and compute the XG,w (d, h) time series data for each keyword and location. The data was extracted from Twitter using the Twitter Firehose access level, which, according to Twitter, “Returns all public statuses.” Our initial dataset included all public tweets posted from May 2010 through May 2011. Tweets from Geographic Regions To reason about patterns in various geographic regions, we need a robust dataset of tweets from each region G to create the sets MG,w . Twitter offers two possible sources of location data. First, a subset of tweets have associated geographic coordinates (i.e., geocoded, containing an `m fields as described above). Second, tweets may have the location available from the profile description of the Twitter user that posted the tweet (`u , as described above). The `m location often represents the location of the user when posting the message (e.g., attached to tweets posted from a GPS-enabled phone), but was not used in our study since only a small and biased portion of the Firehose dataset (about 0.6%) includes `m geographic coordinates. The rest of the paper uses location information derived from the user profile, as described next. Note that the profile location is not likely to be updated as the users move around space: a “Honolulu” user will appear to be tweeting from their hometown even when they are (perhaps temporarily) in Boston. Overall, though, the profile location will reflect tendencies albeit with moderate amount of noise. We used location data associated with the profile of the user posting the tweet, `u , to create our datasets for each location G. A significant challenge in using this data is that the location field on Twitter is a free text field, resulting in data that may not even describe a geographic location, or describe one in an obscure, ambiguous, or unspecific manner (Hecht et al. 2011). However, according to Hecht et al., 1) About 19 of users had an automatically-updated profile location (updated by mobile Twitter apps such as Ubertweet for Blackberry), and 2) 66% of the remaining users had at least some valid location information in their profile; about 70% of those had information that exceeded the city-level data required for this study. In total, the study suggests that 57% of users would have some profile location information appropriate for this study. As Hecht et al. (2011) report, this user-provided data field may still be hard to automatically 1. We used a different Twitter dataset to generate a comprehensive dictionary of profile location strings that match each location in our dataset. For example, strings such as ”New York, NY” or ”NYC” can be included in the New York City dictionary. 2. We match the user profile associated with the tweets in our dataset against the dictionary. If a profile location `u matches one of the dictionary strings for a city, the tweet is associated with that city in our data. We show next how we created a robust and extensive location dictionary for each city. Generating Location Dictionaries The goal of the dictionary generation process was to establish a list of strings for each city that would reliably (i.e., accurately and comprehensively) represent that location, resulting in the most content possible, with as little noise as possible. To this end, we obtained an initial dataset of tweets that are likely to have been generated in one of our locations of interest. We used a previously collected dataset of Twitter messages for the cities in our study. This dataset included, for every city, represented via a point-radius geographic region G, tweets associated with that region. The data was collected from September 2009 until February 2011 using a point-radius query to the Twitter Search API. For such queries, the Twitter Search API returns a set of of geotagged tweets, and tweets geocoded by Twitter to that area (proprietary combination of user profile, IP, and other signals), and that match that point-radius query. For every area G representing a city, we thus obtained a set LG of tweets posted in G, or geocoded by Twitter to G. Notice that other sources of approximate location data (e.g., a strict dataset of geocoded Twitter content from a location) can be alternatively used in the process described below. From the dataset LG , we extracted the profile location `u for the user posting each tweet. For each city, we sorted the various distinct `u strings by decreasing popularity (the number of unique tweets). For example, the top `u strings for 1 In an alternative approach, a user’s home location can be estimated to some degree from the topics they post about on their account (Cheng, Caverlee, and Lee 2010; Eisenstein et al. 2010; Hecht et al. 2011). For example, Cheng et al. (2010) claim 51% within-100-miles accuracy for users in their study. Honolulu, HI! 1! 0.8! 0.6! 0.4! 0.2! 0! San Francisco, CA! 0! 200! 400! 600! 800! 1000! 1200! 1400! 1600! 1800! 1! 0.8! 0.6! 0.4! 0.2! 0! Richmond,VA! 0! 200! 400! 600! 800! 1000! 1200! 1400! 1600! 1800! 1! 0.8! 0.6! 0.4! 0.2! 0! New York, NY! 0! 200! 400! 600! 800! 1000! 1200! 1400! 1600! 1800! 1! 0.8! 0.6! 0.4! 0.2! 0! 0! 200! 400! 600! 800! 1000! 1200! 1400! 1600! 1800! and robustly associate with a real-world location. We will overcome or avoid some of these issues using our method specified below.1 To create our dataset, we had to resolve the free-text user profile location `u associated with a tweet to one of the cities (geographic areas G) in our study. Our solution had the following desired outcomes, in order of importance: • high precision: as little as possible content from outside that location. • high recall: as much as possible content without compromising precision. In order to match the user profile location to a specific city in our dataset, we used a dictionary-based approach. In this approach: Figure 1: CDFs of location field values for the most popular 2000 location strings for four cities the point-radius area representing New York City included the strings New York (14,482,339 tweets), New York, NY (6,955,481), NYC (6,681,874) and so forth, including also subregions of New York City such as Harlem (587,430). We proceeded to clean the top location strings lists for each city from noisy, ambiguous, and erroneous entries. Because the distribution of `u values for each location was heavy-tailed, we chose to only look the top location strings that comprised 90% of unique tweets from each city. Figure 1 shows the cumulative distribution function (CDF) for the location string frequencies for four locations in our dataset. The x-Axis represents the location strings, ordered by popularity. The y-Axis is the cumulative portion of tweets the top strings account for. For example, the 500 most popular location terms in New York account for about 82% of all tweets collected for New York in the LG dataset. Then, for each city, we manually filtered the lists for each city, marking entries that are ambiguous (e.g., “Uptown” or “Chinatown” that Twitter coded as New York City) or incorrect, due to geocoder issues, e.g. “Pearl of the Orient” that the Twitter geocoder associates with Boston, for one reason or another; see Hecht and Chi’s (2011) discussion for more on this topic. This annotation process, performed by one of the authors, involved lookups to external sources if needed to determine which terms were slang phrases or nicknames for a city as opposed to errors. For this study, we selected a set of US state capitals, adding a number of large US metropolitans that are not state capitals (such as New York and Los Angeles), as well as London, England. This selection allowed us to study cities of various scales and patterns of Twitter activity. As mentioned above, using these lists of `u location strings for the cities, we filtered the Twitter Firehose dataset to extract tweets from users whose location fields exactly matched a string on one of our cities’ lists. Using this process, we generated datasets of tweets from each city, summarized in Figure 2. The figure shows, for each city, the number of tweets in our 380-day dataset (in millions). For example, we have over 150 million tweets from Los Angeles, an average of about 400,000 a day. Notice the sharp drop in the number of tweets between the large metropolitans to local centers (Lansing, Michigan with 2.5 million tweets, an average of 6,500 per day). We only kept locations with over 2.5 million tweets, i.e. the 29 locations shown in Figure 2. Columbia,"SC" Lansing,"MI" Charleston,"WV" Honolulu,"HI" LiWle"Rock,"AR" Salt"Lake"City,"UT" Raleigh,"NC" Baton"Rouge,"LA" Frankfort,"KY" Oklahoma"City,"OK" Tallahassee,"FL" Sacramento,"CA" Nashville,"TN" Saint"Paul,"MN" Richmond,"VA" Denver,"CO" Indianapolis,"IN" AusMn,"TX" Columbus,"OH" Pheonix,"AZ" Annapolis,"MD" Olympia,"WA" Washington"DC" Boston,"MA" San"Francisco,"CA" Atlanta,"GA" Los"Angeles,"CA" London,"England" New"York,"NY" 0" 100" 200" Millions"of"Tweets" 300" Figure 2: Location and number of associated tweets in our dataset, based on user location field matching. Keywords and Hourly Volume Time Series We limited our analysis in this work to the 1,000 most frequent words appearing in our dataset (i.e., in the union of sets MG for all locations G). We then calculated the frequency for all words in the dataset, removing posts with user profile language field that was not English. We also removed stopwords, using the NLTK toolkit. All punctuation and symbols, except for “@” and “#” at the beginning of the term and underscores anywhere in the term, and all URLs, were removed. We removed terms that were not in English, keeping English slang and vernacular intact. In each geographic region, for each of the 1,000 terms, we created the time series XG,w (d, h) as described above, capturing the raw hourly volume of tweets during the days available in our dataset. We converted the UTC timestamps of tweets to local time for each city, accounting for daylight savings. For each city, we computed the time series representing the total hourly volume for every geographic region, denoted as XG (d, h). Data collection issues resulted in a few missing days in our data collection period. For simplicity of presentation, we only use weekdays in our analysis below as weekend days are expected to show highly divergent diurnal patterns. We had a total of 272 weekdays in our dataset for each city. Diurnal Patterns The goal of this analysis is to examine and reason about the diurnal patterns of different keywords, for each geographic region. We first examine the overall volume patterns in each location, partially to validate the correctness of the collection method for our location dataset. Natural Periodicity in Cities As expected, the overall volume patterns of Twitter activity for the cities in our dataset show common and strong diurnal Figure 3: Sparklines for volume by hour for the cities in our dataset (midnight to 11pm) shifts. Figure 3 shows sparklines for the average hourly proportion of volume, for all cities in our data, averaged over all weekdays in our dataset. In other words, the figure shows XG (d,h) the average over all days of P23 for h = 0, . . . , 23 i=0 XG (d,i) (midnight to 11pm), where XG (d, h) is the total volume of tweets in hour h of day d. The minimum and maximum points of each sparkline are marked. The figure shows the natural rhythm of Twitter activity, which is mostly similar across the locations in our dataset, but not without differences between locations. As would be expected, in each location, Twitter experiences greater volume during the daytime and less late at night and in early morning. The lowest level of activity, in most cities, is at 4-5am. Similarly, most cities show an afternoon peak and late night peak, where the highest level of activity varies between cities with most peaking at 10-11pm. However, the figure also demonstrates differences between the cities. The consistency of patterns in the different cities is evidence that our data extraction is fairly solid, where most tweets included were produced in the same timezone (at the very least) as the city in question. Another takeaway from Figure 3 is that Twitter volume alone is probably not a immediate indication for the urban “rhythm” of a city: in other words, it is unlikely that the rise in late-night activity at cities like Lansing, Michigan is due to Lansing’s many nightclubs and bars (there aren’t many), but rather to heightened virtual activity that may not reflect the city’s urban trends. We turn, therefore, to examining specific activities in cities based on analysis of specific keywords. Keyword Periodicity in Cities We aim to identify, for each city, keywords that show robust and consistent daily variation. These keywords and their daily temporal patterns represent the pulse of the city, and can help reason about the differences between cities using Twitter data. These criteria should apply to the identified keywords in each location: • The keyword shows daily differences of significant rela- tive magnitude (peak hours show significantly more activity). • The keyword’s daily patterns are consistent (low day-today variation). • The keyword patterns cannot be explained by Twitter’s overall volume patterns for the location. Indeed, we were only interested in the keyword and location pairs that display periodicity in excess of the natural rhythm that arises from people tweeting less during sleeping hours. This daily fluctuation in Twitter volume means that daily periodicity is likely to exist in the series XG,w (d, h) for most keywords w and cities G. To remove this underlying periodicity and normalize the volumes, we look at the following time series transformation for each G, w time series: XG,w (d, h) (1) XG (d, h) The g transformation captures the keyword w hourly volume’s portion of the total hourly volume during the same hour d, h. Here, we also account for the drift in Twitter volume over the time of the data collection. Similar normalization was also used in (Golder and Macy 2011; Dodds et al. 2011). From here on, we use Equation 1 in our analysis. In other words, we will refer to the time series bG,w (d, h) which represents the g transformation applied to X the original time series of volume values, XG,w (d, h). b time series we further define the diurBased on the X nal patterns series X24G,w (h) as the normalized volume, in each city, for every keyword, for every one hour span (e.g. between 5 and 6pm), across all weekdays in the entire span of data. In other words, for a given G, w: Pn b XG,w (d, h) for h = 0 . . . 23 (2) X24G,w (h) = d=1 n In the following, we use X24 as a model in an information-theoretic sense: the series represents expected values for the variables for each keyword and location. We b series to reason about the then use X24 and the complete X information embedded within days, and across days, in different locations and for different keywords. These measures give us the significance and stability (or variability) of the patterns, respectively. gG,w (d, h) = Variability Within Days To capture variations from the model across all the days in our data we used the information-theoretic measure of entropy. Using the X24(h) values, we calcuP23 late − h=0 p(h)log(p(h)) where for each h, p(h) = X24G,w (h) P23 (the proportion of mean relative volume for i=0 X24G,w (i) that hour, to the total mean relative volumes). Entropy captures the notion of how much information or structure is contained in a distribution. Flat distributions, which are close to uniform, contain little information, and have high entropy. Peaked distributions, on the other hand, have a specific structure, and are non-entropic. We treat X24 as a distribution over relative volume for each hour to get entropy scores for each keyword. We have experimented with alternative measures for within-day variance. For example, Fourier transforms, which can decompose a signal from its time domain to its frequency domain, was not an affective measure for this study: almost all keywords show a significant power around b sethe 24 hour Fourier coefficient, even when using the X ries that controls for overall Twitter volume. Variability Across Days To capture variability from the model across all the days in our data we used Mean Absolute Percentage Error (MAPE) from X24, following Shimshoni et al. (2009). For a single day in the data, MAPE is defined as: M AP EG,w (d) = P b |X24(i)−X(d,i)| . We then use M AP EG,w , the avi=0...23 X24(i) erage over all days of the MAPE values for the keyword w in location G. We have experimented with alternative measures for across-day variability, including the Kullback-Leibler (KL) Divergence and the coefficient of variation (a normalized version of the standard deviation), time series autocorrelation, and the average day-to-day variation. For lack of space, we do not report on these additional measures here but note that they can provide alternatives for analysis of across-day variation of diurnal patterns. Analysis We used these measures to study the diurnal patterns of the top keywords dataset described above for the 29 locations in our dataset. In this section we initially address a number of questions regarding the use and interpretation of the patterns: What is the connection between within- and acrossday variability, and what keywords demonstrate each? How significantly is the variability influenced by volume, and is there enough volume to reason about diurnal patterns in different locations? How do these patterns capture physicalworld versus virtual activities, and can these patterns be used to find similarities between locations? Figure 4 shows scatter plots of keywords w for tweets from locations G=New York, in 4(a) and G=Honolulu, in 4(b). The words are plotted according to their withinday variability (entropy) on the x-Axis, where variation is higher for keywords to the right, and across-day variability (MAPE) on the y-Axis, where keywords with higher variability are higher. Pointing out a few examples, the keyword “lunch” shows high within-day variability and relatively low across-day variability in both New York and Honolulu: the keyword demonstrates stable and significant daily patterns. On the other hand, the keyword @justinbieber is mentioned with little variation throughout the day, but with very high across-day variability. Both parts of Figure 4 show a sample area magnified: in 4(a), an area with high across-day variability and low within-day variability, and in 4(b), an area with both low across-day variability and low withinday variability. A number of key observations can be made from Figure 4. First, there are only a few keywords with #tweetmyjobs #ff snow christmas april july march june holiday japan 1.5! 2011 rain @justinb ieber heat tuesday rip xxx friday cup oil awards summer football blvd 2010 beach cold gift weekend #job rain 2011 monday @justinbieber spring @youtube heat rip wednesday tuesday xxx friday oil cup summer awards 2010 gaga blvd football beach thursday weather weekend cold gift seattle shopping miami #teamfollowba vote obama francisco @foursquare wedding plz concert wind itunes #jobs cancer boston atlanta paul president bieber james tha angeles united final jersey breaking church enter justin ipad chat mixtape channel android paid tickets com official won google code jobs update pls class stories hotel airport les artist year outside sunday hollywood fashion wen dis ave ck chris sum chicago 100 brown doin may photo loss dnt sat retweet ball mac apple north weight killed youtube evening drunk series police michael vegas king iphone bill sun relationship ice release south wat search price washington restaurant details player pizza data comment park study traffic click national death followed band congrats fam paper nite afternoon stuck law download dress saturday safe san war album shirt http manager choice mother watched cover books tea international mark movies guide date site cream kno joke london goin quite jbeer us round brand sign lead due american digital followers ideas drinking bank happy shower station health community card mobile fair secret services jesus press training web view credit bus etc account season #np share para fall episode sports shout singing country por major #nowplaying area dead shop las beauty heading network cake seem space win attention project thx plans thoughts visit wife wine message game moment leaving gym wall starts driving extra four played gay chocolate double program 2nd pro imma proud dark test control able david event lovely shoes action sales luv ugly dat single state wearing track history school gettin taken million cannot john child lady road walking center fire liked bday ride roll yesterday middle bored giving east bet lose yep college dad mix soo wonderful energy lord spend simple mine calls america local parents near act system fake social questions pop lucky human others west chicken 0 minute month meeting app town children interview fans age building looked ways strong version text step tryna #fb sister woke gold sale water hot tweets tour page apparently public star clean sense videos con bought travel box coffee cry smart five touch arent lie met babe finish poor missing fresh daily computer los family shot service personal website lost youve bar order trust alright low wit dreams hilarious possible record calling ladies chance report fly dick random worry train cat door support dumb drop black dinner fan pictures marketing studio pass busy games available latest info power rather hahahaha spot along experience whats blog green gives somebody sex design media important piece sitting super self case speak success sit join answer bitches til office pain wear shoot review art smile loving sooo lets via mouth photos shut thru dog moving 1st follow bag email lots sent group internet playing matter store writing took songs drink slow easy loved meant drive using hopefully suck deal inside article whatever market fight post pics tips fuckin plus eye bout dream boo club happens air pick trip finished party loves street needed truth posted fat blue instead laugh radio special aww seems ima works stand nobody anyway quick tweeting months dance asked list film boys hurt save weeks team serious huge huh kill everybody lil number facebook hands birthday straight forever type rock shows niggas future walk style running exactly minutes feels years week break tried gave red add enjoy end lmfao brother cuz write behind ones catch picture fact forward worst front couple luck knows yup def light hungry half dear die cut open agree company swear close sucks fast plan hold line following beautiful found happened women beat link sick short mom young hello idk btw missed voice tonight takes nah woman bit youll run movie pay understand free totally kid online perfect weird eating excited reading wanted wake forgot sad worth second living car send buy sexy starting y ea interesting past couldnt hour less house wtf men yay ask question please hand words idea alone seeing sound kinda god white fine listening learn twitter different friends turn ppl knew nigga small cont problem three wouldnt bring sometimes means rest yall gone anymore called till forget story hair wants tired full body started stupid listen point meet together later pic waiting ago move book times news job tomorrow probably side happen hours wasnt seriously ugh set hit boy girls definitely bed wonder reason leave watching wont course bro son high kids hey world saying feeling eyes needs part theyre early comes finally eat top watch heart change mad stuff came either least bitch read used might talk care omg taking fun goes believe city almost ready food favorite friend sweet without enough someone isnt kind dude heard working thinking glad far hahaha song person says york left smh gets live room tho went stay late remember wow miss fucking okay try cute nyc phone away amazing mind place sounds yet seen aint money day tweet everyone music looks play told business help lot talking whole days saw else trying check pretty sleep True crazy lmao havent wrong anything hear give name home making use call must alol ss video everything damn funny guy old since hard show cool coming night baby face anyone doesnt wanna start done stop hell word also actually maybe cause guess head thank awesome already soon fuck gotta sorry mean makes today find wish real hate every many looking hope nothing thanks last around haha though girl two new little gonna another guys thought put tell things could nice yeah big something keep long work first next sure made feel best wait bad said shit man always never let life got ever thing even getting would back yes come people look still take say great better way want much well going think make right need really love see good time know get one 4.6" 4.5" 4.4" 4.3" MAPE""(across4day"variability)" 2! 4.2" Entropy"(within4day"variability)" 1! 0.5! lunch breakfast 0! 4.1" 4" (a) New York, NY MAPE(((across-day(variability)( goes 2! person hit face believe xxx must real atlanta things every find keep mixtape angeles hollywood wen could miami london plz man les boston best jersey life francisco kno niggas tryna dis gaga com sum def dnt train shout manager chicago seattle still doin people tha heat followed #teamfollowba washington back rip joke july artist nyc ugly alright snow lovely comment itunes boo etc christmas awards church bitches dick station march fashion april fuckin ball imma gettin loss digital sit code credit relationship paul piece lie brand june cry singing nite dumb lord paid pop mouth jesus james cup time random beauty roll hilarious apparently evening soo press fake nobody ima player swear shoes proud weight dress round nah ck self spring mix drinking pain attention worry 1.5! michael fam drunk fair studio services played chat simple wall bank guide meant trust #ff seem wat war shower smart york search #np space killed brown ideas step fly somebody sense cat writing holiday taken @justinbieber cake poor chris double looked data shirt loving type leaving track lmfao david lead yep minute whatever spot shut questions laugh human calling five anyway personal stuck international sooo major control thoughts band straight gold travel ladies nigga shoot gives speak marketing watched download luv spend account sitting wearing oil sales quite dark afternoon fact mark choice books whats hands age goin along needed gym woke cancer suck beer @foursquare paper exactly mother brother calls truth message record mac knows worst act bday videos drop united bag low everybody release serious pls dreams training cannot four history wine sent huh pictures eye songs wife won strong click rather hahahaha sister mobile con walking alone finish football bet door driving tweeting quick aww country hurt answer channel write possible sat met building movies action success touch dat lets breakfast loves version clean thx kill area ways sports para enter price pizza happens style tuesday interview president middle important por due babe experience extra thru cover youve law plans secret arent breaking yea stories retweet san forever couldnt plus radio wednesday box moving lose tea months parents understand death anymore cream child heading las hopefully bored series design airport bus community wedding matter missing details ride stand computer wit chocolate wonderful concert system tour front rain tips south pass final america jus 2nd blvd college starts finished episode @youtube smile happened hand lil loved program view inside shop network bought dear fat fight gave die games knew west film sucks til latest pro smh wear able hungry sexy slow lucky huge mad wouldnt tickets article definitely idk kid outside jobs bieber btw sound fresh north voice number asked card sun dad los energy weird bro moment fine john hold agree japan gift yup picture http gay wake catch short bill group national company dance cut boys single text share forgot eyes report vegas vote test ones web shot road couple happen tried instead add near thursday ice lots official #nowplaying project takes million children son point order behind date question weeks fans shopping safe justin ppl seeing seems lady living chicken monday website second busy obama liked stupid different sale fire dream works tho either summer american yall pics info police android page sunday shows fall hello means light dead pick public cold dog words listen meeting young reason future walk gone park problem course market trip youll study art sign perfect king 0town media giving close ave fan 100 turn less mine kinda congrats together reading club sick yesterday save send hotel totally daily luck internet fast running small move email link forward missed album rest green wasnt interesting listening drink saying drive bout three ago minutes body till bit ipad power blue rock seriously local star woman bar fucking pay line started following cuz chance starting air eating weather available bitch plan dude beat took past break learn site wtf visit wanted ugh traffic easy sometimes health picoffice sad service google excited list review followers season store case told 1st hair idea wants lunch saturday sideeast cont coffee worth boy event forget came app wonder aint apple probably comes half deal update kind called month leave whole water 2011 room dinner tweets others meet street cute lost bring red social theyre posted using taking later heard waiting goes 1! feeling needsokay restaurant photos team yay glad enjoy favorite class wrong seen far talking join sweet center else book tired mom men havent lmao story word least bed part white 2010 playing thinking enough hell heart support without mind super used business special sounds face girls friday change set ask isnt full beach cause hour youtube crazy remember almost black run stuff weekend wont person birthday care believe #fb times iphone hit name movie money amazing left car anything went late city actually found ass True funny hear saw gets hours job song kids online beautiful guy state family women end read baby nothing mean fuck eat everything years stay open high working game place making trying finally hahaha head play sex might though party guess friend since lot omg also anyone buy put says try doesnt talk god early sorry help win maybe tweet must facebook thought year please makes damn hot gotta hard ready top post away music looks phone real hate pretty follow done hey yet blog shit soon many old guys call friends around every photo wanna cool start wish said stop house days food sleep girl watching already wow miss give two coming awesome made looking someone tell sure another little things ever find keep may thing yeah watch thank live fun wait school everyone something bad tomorrow long use week yes happytonight world news nice show morning feel always big getting even could free look gonna via next check never twitter man let wind hope haha say better take feels best video home come life first would way last well much thanks 0.5! night really work got right make still great people want need going back think see day today know love lol new time one good get 0! 4.6! 4.5! 4.4! 4.3! 4.2! Entropy((within-day(variability)( 4.1! 4! (b) Honolulu, HI Figure 4: Within-day patterns (entropy) versus across-day patterns (MAPE) for keywords in New York City and Honolulu How much information is needed in one area to be able to robustly reason about stability of patterns? Given the differences in MAPE values shown between the cities in Figure 4, we briefly examine the effect of volume on the stability of patterns. Figure 5 plots the cities in our dataset according to the volume of content for each location (on the x-axis) and the number of keywords with MAPE scores below a given threshold (y-Axis). For example, in Los Angeles, with 150 million tweets, 955 out of the top 1000 keywords have MAPE values lower than 50% (i.e. relatively robust), and 270 of these keywords have MAPE lower than 20%. These values leave some hope for detection of patterns that are stable and reliable in Los Angeles – i.e., the MAPE outliers. However, Keywords with MAPE Lower than X%! significant within-day variability, and most of them demonstrate low variation across days. The Spam keyword #jobs shows higher than usual inter-day variability amongst the keywords with high within-day variability. Second, perhaps as expected, the highly stable keywords are usually popular, generic keywords that carry little discriminatory information. Third, patterns in the data can be detected even for a low-volume location such as Honolulu (with 1% of the tweet volume of New York in our dataset). Notice, though, that the set of keywords with high within-day variation (left) in the Honolulu scatter are much more noisy, and that the across-day variation in Honolulu is higher for all keywords (the axes scales in figures 4(a) and 4(b) are identical). 1000! 800! 600! 400! 50%! 200! 30%! 20%! 0! 0! 50! 100! 150! 200! 250! Location dataset size (millions of tweets) ! 300! Figure 5: Volume of tweets for a city versus the number of keywords below the given thresholds as we examine cities with volume lower than 5 million we see that few keywords have MAPE values lower than 50%. With such low numbers, we may not be able to detect reliable diurnal patterns, even for keywords that are expected to demonstrate them. We now turn to address the questions about the connection between daily patterns and activities in cities. Can patterns capture real or virtual activities in cities? How much do the patterns represent daily urban routines? We first discuss results for three sample keywords. Then, we look at using keywords to find similarities across locations. Figure shows the daily patterns of three sample keywords as a proportion of the total volume for each hour of the day. The keywords shown are “funny” in 6(a), “sleep” in 6(b) and “lunch” in 6(c). For each keyword, we show the daily curves in multiple cities: New York, Washington DC, and the much smaller Richmond, Virginia. The curve represents the X24G,w values for the keyword and location as described above. The error bars represent one standard deviation above and below the mean for each X24 series. For example, Figure 6(c) shows that in terms of proportion of tweets, the “lunch” keyword peaks in the three cities at 12pm. In Washington DC, for example, the peak demonstrates that 0.8% of the tweets posted between 12pm and 1pm include the word lunch, versus an average of about 0.05% of the tweets posted between midnight and 1am. The keywords were chosen to demonstrate weak (“funny”) and strong (”lunch”, “sleep”) within-day variation across all locations. Indeed, the average entropy for “lunch” across the three locations is 3.95, the entropy for “sleep” is 4.1, while the “funny” entropy shows lower within-day variability at 4.55. The across-day variability is less consistent between the locations. For example, “sleep”, the more popular keyword out of these three in all locations, has higher MAPE in New York than “funny”, but lower MAPE in the other two locations. Figure 6(a) shows how the variability of “funny” rises (larger error bars) with the lower-volume cities. The figure demonstrates the mix between virtual and physical activities reflected in Twitter. The keywords “funny”, and, for example, “lol” (the most popular keyword in our dataset) exhibit diurnal patterns that are quite robust for high-volume keywords and cities. However, those patterns are less stable and demonstrate high values of noise with lower volume for XG,w . Moreover, the times in which they appear could be more reflective of the activities people perform online than reflect actual mood or happiness: 6am is perhaps not the ideal time to share funny videos. On the New"York" Washington"" Richmond" 0.01" 0.008" 0.006" 0.004" 0.002" 0" 12am" 1am"" 2am"" 3am" 4am" 5am" 6am"" 7am" 8am" 9am" 10am" 11am" 12pm" 1pm" 2pm" 3pm" 4pm" 5pm" 6pm" 7pm" 8pm" 9pm" 10pm" 11pm" Propor%on'of'Volume'in'City' 0.012" Time'of'Day' 0.04" (a) Random New"York" Washington"" Richmond" 0.03" 0.02" 0.01" 0" 12am" 1am"" 2am"" 3am" 4am" 5am" 6am"" 7am" 8am" 9am" 10am" 11am" 12pm" 1pm" 2pm" 3pm" 4pm" 5pm" 6pm" 7pm" 8pm" 9pm" 10pm" 11pm" Propor%on'of'Volume'in'City' (a) “Funny” Time'of'Day' 0.012" New"York" Washington"" Richmond" 0.01" 0.008" (b) Stable 0.006" 0.004" 0.002" 0" 12am" 1am"" 2am"" 3am" 4am" 5am" 6am"" 7am" 8am" 9am" 10am" 11am" 12pm" 1pm" 2pm" 3pm" 4pm" 5pm" 6pm" 7pm" 8pm" 9pm" 10pm" 11pm" Propor%on'of'Volume'in'City' (b) “Sleep” Time'of'Day' (c) “Lunch” Figure 6: Daily patterns for three keywords, showing the average proportion of daily volume for each hour of the day other hand, keywords that seem to represent real-world activities may provide a biased view. While it is feasible that Figure 6(c) reflects actual diurnal lunch patterns of the urban population, Figure 6(b) is not likely to be reflecting the time people go to sleep – peaking at 3am. Nevertheless, we hypothesize that the diurnal patterns, representing virtual or physical activities, can help us develop an understanding of the similarities and differences in daily routines between cities. Can the similarity between “lunch” and “sleep” diurnal patterns in two cities suggest that the cities are similar? To initially explore this question, we compared the X24 series for a set of keywords, for each pair of cities. We measured the similarity of diurnal patterns using the JensonShannon (JS) divergence between the X24 series of the same keyword in a pair of cities. JS is analogous to entropy, and defined over two distributions P and Q of equal size (in our case, 24 hours) as follows: " 23 # 23 X 1 X JS(P, Q) = 2 gp(i) + i=0 gq(i) (3) i=0 pi pi Where gp(i) = (pi +q log (pi +q , and gq(i) = i )/2 i )/2 qi qi (pi +qi )/2 log (pi +qi )/2 . For our purposes, the distributions pi and qi were the normalized hourly mean-volume values for a specific keyword, as we did for the entropy calculation. The distance between two cities G1 , G2 with respect to keyword w is thus JS(pG1 ,w , pG2 ,w ). The distance with respect to a set of keywords is the sum over distances of the words in the set. For comparison, we selected three sets of ten keywords to use when comparing locations: a random set of keywords, a (c) Significant Figure 7: 10 most similar city pairs computed using different sets of keywords. set of keyword with high entropy (“significant”), and a set of keywords with low MAPE (“stable”). We then plotted the top 10 most similar city pairs, according to each of these groups, on a map, as shown in Figure 7. The thickness of the arc represents the strength of the similarity (scaled linearly and rounded to integer thickness values from 1 to 5). The similarities calculated using the 10 random keywords exhibit quite low coherence. While there are clearly a few city pairs that are closely connected (San Francsico-Los Angeles, Boston-Washington), for the most part it is hard to discern regular patterns in the data. Using the top 10 most stable keywords results in a more coherent cluster of major east-coast cities, along with a smaller, weakly connected cluster centering on San Francisco. When we calculate the similarity using the top 10 highest entropy keywords, the two clusters become much clearer, and the locality effect is stronger, resulting in a well-connected west-coast cluster centered around San Francisco, and an east-coast cluster in which Washington D.C. serves as a hub. Despite this promising result, it remains to be seen whether these effects are mostly due to timezone similarity between cities. Discussion and Conclusions We have extracted a large and relatively robust dataset of location data from Twitter, and used it to reason about diurnal pattern in different cities. Our method of collection of data, using the users’ profile location field, is likely to have resulted in a high precision dataset with relatively high recall. We estimate that we are at least within an order of magnitude of the content generated in each city. While issues of Spam, errors and noise still exist, these issues were minimal (and largely ignored in this report). This broad coverage allowed us to investigate the feasibility of a study of keyword-based diurnal patterns in different locations. We showed that patterns can be extracted for keywords, but that the amount of variability is highly dependent on the volume of activity for a city and a keyword. For low-volume cities, most keyword patterns are likely to be too noisy to reason about in any way. On the other hand, keywords with high within-day variability could be detected even for low-volume cities. The mapping from diurnal Twitter activities to physical-world (vs. virtual) activities and concepts is not yet well defined. Use of keywords like “funny” and “lol”, for example, that can express mood and happiness (Golder and Macy 2011), can also relate to virtual activities people more readily engage with at different times of day. The reflection of physical activities can also be ambiguous, as shown with the analysis of the keyword “sleep” above. This exploratory study points to a number of directions for future work. The variability error bars in Figure suggest that the XG,w series might be modelled well using a Poisson arrival model, which will apply itself better to studies and simulations. Topic models and keyword associations might help merge the data for a set of keywords to enhance the volume of data available for analysis for a given topic or theme. This method or others that could raise the volume of the analyzed time series can allow more robust analysis including, for example, detection of deviation from expected patterns for different topics. Next steps will also include the development of better statistical tools for analyzing variability and stability of these time series, as well as comparisons against a model or other time series data. We especially are interested in comparing the diurnal patterns in social media to other sources of data – as social media data can augment or replace more costly methods of data collection and analysis. Acknowledgments This work was partially supported by the National Science Foundation CAREER-1054177 grant, a Yahoo! Faculty Engagement Award, and a Google Research Award. The authors thank Jake Hofman for his significant contributions to this work. References Ahern, S.; Naaman, M.; Nair, R.; and Yang, J. H.-I. 2007. World Explorer: visualizing aggregate data from unstructured text in geo-referenced collections. In JCDL ’07: Proceedings of the 7th ACM/IEEE-CS joint conference on Digital libraries, 1–10. New York, NY, USA: ACM. Cheng, Z.; Caverlee, J.; Lee, K.; and Sui, D. Z. 2011. Exploring Millions of Footprints in Location Sharing Services. In ICWSM ’11: Proceedings of The International AAAI Conference on Weblogs and Social Media. AAAI. Cheng, Z.; Caverlee, J.; and Lee, K. 2010. You are where you tweet: a content-based approach to geo-locating twitter users. In Proceedings of the 19th ACM international conference on Information and knowledge management - CIKM ’10. New York, New York, USA: ACM Press. Crandall, D.; Backstrom, L.; Huttenlocher, D.; and Kleinberg, J. 2009. Mapping the World’s Photos. In WWW ’09: Proceeding of the 18th international conference on World Wide Web. New York, NY, USA: ACM. Cuff, D.; Hansen, M.; and Kang, J. 2008. Urban sensing: out of the woods. Commun. ACM 51(3):24–33. Dodds, P. S.; Harris, K. D.; Kloumann, I. M.; Bliss, C. a.; and Danforth, C. M. 2011. Temporal Patterns of Happiness and Information in a Global Social Network: Hedonometrics and Twitter. PLoS ONE 6(12):e26752. Eisenstein, J.; O’Connor, B.; Smith, N. A.; and Xing, E. P. 2010. A latent variable model for geographic lexical variation. In EMNLP ’10: Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, 1277– 1287. Stroudsburg, PA, USA: Association for Computational Linguistics. Girardin, F.; Fiore, F. D.; Ratti, C.; and Blat, J. 2008. Leveraging explicitly disclosed location information to understand tourist dynamics: a case study. Journal of Location Based Services 2(1):41–56. Golder, S. A., and Macy, M. W. 2011. Diurnal and Seasonal Mood Vary with Work, Sleep, and Daylength Across Diverse Cultures. Science 333(6051):1878–1881. Golder, S. A.; Wilkinson, D.; and Huberman, B. A. 2007. Rhythms of social interaction: messaging within a massive online network. In CT ’07: Proceedings of the 3rd International Conference on Communities and Technologies. Gonzalez, M. C.; Hidalgo, C. A.; and Barabasi, A.-L. 2008. Understanding individual human mobility patterns. Nature 453(7196):779–782. Hecht, B.; Hong, L.; Suh, B.; and Chi, E. H. 2011. Tweets from Justin Bieber’s heart: the dynamics of the location field in user profiles. In CHI ’11: Proceedings of the 2011 annual conference on Human Factors in Computing Systems, 237– 246. New York, NY, USA: ACM. Java, A.; Song, X.; Finin, T.; and Tseng, B. 2007. Why we Twitter: understanding microblogging usage and communities. In WebKDD/SNA-KDD ’07: Proceedings of the WebKDD and SNA-KDD 2007 workshop on Web mining and social network analysis, 56–65. New York, NY, USA: ACM. Naaman, M.; Boase, J.; and Lai, C.-H. 2010. Is it really about me? Message content in social awareness streams. In CSCW ’10: Proceedings of the 2010 ACM Conference on Computer Supported Cooperative Work, 189–192. New York, NY, USA: ACM. Shimshoni, Y.; Efron, N.; and Matias, Y. 2009. On the Predictability of Search Trends. Technical report, Google. Available from http://research.google.com/archive/google trends predictability.pdf.