Geosocial big data analysis using python and FOSS4G with the case study of Korean data Ilyoung Hong Namseoul Univ Dep of GIS engineering Geosocial data • Social Media- Tweeter, Facebook is the killer app for Smartphone • Smart Phone with GPS generates lots of geotagged social data • Social data with geotagged is called geosocial data • Such as GeoTweet - geotagged tweet, 4sq Venues Geosocial Data Researches • Fujita, Hideyuki. "Geo-tagged Twitter collection and visualization system." Cartography and Geographic Information Science 40.3 (2013): 183-191. • =>Computational method, data collection • Jung, Jin‐Kyu. "Code clouds: Qualitative geovisualization of geotweets." The Canadian Geographer/Le Géographe canadien 59.1 (2015): 52-68. => qualitative approach, with content analysis • Li, Linna, Michael F. Goodchild, and Bo Xu. "Spatial, temporal, and socioeconomic patterns in the use of Twitter and Flickr." Cartography and Geographic Information Science 40.2 (2013): 6177. Spatial statistical analysis with geodemographic data, • Mitchell, Lewis, et al. "The geography of happiness: Connecting twitter sentiment and expression, demographics, and objective characteristics of place." (2013): e64417. =>Sentimental analysis, computational linguistics approach, Multi Disciplinary Aspects of geosocial data analysis Statistics Linguistics Text Mining Sociology Journalism Media GeoSocial Data Data Collection Web programming Data Management Data Analyzing Database Management Qualitative Analysis Quantitative Analysis Data Visualization Geography, Cartography, GIS challenges of geosocial research • different data source, format • Tweet, foursquare, Facebook, • different analysis environment, difference software • Java, php, Python, C, R, ArcGIS, web-programming, database programming, statistics, geovisualizatrion, • different domain knowledge, multidisciplinary research methods • Computation, geography, sociology, psychology, statistics, linguistics, media, journalism Need interdisciplinary cooperation, Are there any way to Integrate these methods? Why python/foss4g for Geosocial Big Data? • Integrated analysis environment in software, library • Python is free and open. • Object-oriented programming (OOP) in Python • WinPython, Anaconda(SCIPY,Ipython), Enthought Canopy for Python 2.7 • large amount of libraries, support different domain knowledge • PyPI - the Python Package Index, currently 66086 packages • Simple Coding environment • Quick to Learn and to code • Readability The syntax of Python is readable and clear. Research Purpose • Introduce the intergrated platform to analysize the GeoSocial using python & FOSS4G • Data collection, management • Data Analysis, Qualitative & Quantitave methods • Sentimenal Analysis • Geovisualizaing • Present the Case Study with Korean Geosocail Data • GeoTweet distribution • Spatial Patterns of Fousquare Venues • Sentimenal Anlysis of Korean GeoTweet Architecteture, at beginning Excel csv Twitter/ Foursquare API Socail Media JSON Shape ArcGIS Data Collection • Python Streaming API, tweepy • limited rates for one user • However, there is a restriction on data collection from Twitter: the method • call of Twitter API is limited by 350 calls per hour for one authorized developer account • switch to the other user id when reach to the limits • unnecessary data.. filtering • geotweet data is just 1% of total tweet Columns from Tweet ● Tweet text; => qualitative approach, text mining, keword filter, sentimental analysis, ● Tweet ID; User ID; Destination user ID (only for tweets with “@user ID”); User profile (including location name input by user); => behavioral features, heavy user feature, social network, ● Location coordinates (only for tweets tagged with the location coordinates). • Geovisualization, Spatial Analysis using GIS ● Date and Time => temporal analysis until now, made two researches • Spatial Analysis of Location-Based Social Networks in Seoul, Korea, Journal of Geographic Information System, 2015, 7, 259-265 • Spatial Distribution of Korean Geotweets* Journal of the Korean Cartographic Association, 2015, 15(2), 93-101 Spatial Analysis of Location-Based Social Networks in Seoul, • The purpose of this study is to analyze the spatial patterns of location-based social network (LBSN) data in Seoul using the spatial analysis techniques of geographic information system (GIS). The study explores the applications of LBSN data by analyzing the association between Seoul’s Foursquare venues data created based on user participation and the city’s characteristics. The data regarding Foursquare venues were compiled with a program we created based on Foursquare’s Python API. The compiled information was converted into GIS data, which in turn was depicted as a heat map. Cluster analysis was then performed based on hotspots and the correlation with census variables was analyzed for each administrative unit using geographically weighted regression (GWR). Based on analytical results, we were able to identify venue clusters around city centers, as well as differences in hotspots for various venue categories and correlations with census variables. about 230,000 venue data were collected for analysis between March 15 and 21, 2015 Spatial Distribution of Korean Geotweets* In this study, we analyzed the distribution of Korean geotweet. Geotweet was analyzed, which was collected at November 2014 through Twitter Streaming API. Using the Python programming, it was carried out to analyze the collected data and GIS data conversion. Twitter use and distribution are concentrated at Seoul and the metropolitan areas and a few heavy users were creating a large number of tweets. Time series analysis showed the characteristics of the tweets that make up the highest point on the Weekend and forms the highest point at 14:00 during the day. In addition, differences in the content that appears every high percentage of retweets and regions through text analysis were also identified. Key Words : Tweeter API, Geotweet, Spatial distribution • Nov, 2014, over 2 million tweet was collected. Distribution of geotweet, Nov 2014 Spatial Distribution of geotweet Daily Distribution of geotweet, Nov 2014 Text analysis • high percentage of retweet • some keyword that represent regional features • PyTag, Word_cloud Problems • Using Exoplanary Statistic Analysis, Repeated Works but the process is not automated • Takes times, Data Error • As time goes by, the data comes to be too big to handle. • Need to be managed at database, not as a text file • Data and Software show be compatible at the same environment for the automated analysis Python & FOSS4G • integrated analysis environment • large amount of libraries, support different domain knowledge • create the automated scripts for analysis Social Media Server Data Collection Data Parsing pyspatialite Twitter API - Tweepy GIS Data Server Data Conversion Spatialite Visualize Client Geovisualization Quantum GIS WodCloud pytagcloud pyspatialite Analysis Client Shape/Text Sentiment Analysis Python NLTK Statistical Analysis PySAL PANDAS for Data Analysis Analysis Process Text Mining Quantitatives Social Media Data GeoTaged? GIS Database Data Type? Analysis method? Quantatives Setiment Analysis Spatial Analysis Statisitcal Analysis Word Clouds Visualiing Method? HeatMap Thematic Mapping Hotspot GWR Spatialite Database, Why -Standalone & File Based Database: easy to handle - Compatable, interoperability: Python, QGIS, ArcGIS, export/import to any format - Easy to useability, GUI pyspatialite Sentiment Analysis with Python NLTK Text Classification • sentiment analysis using a NLTK • Tweet Text => POS, NEU, NEG values Heatmap using Quantum GIS 2015, July, geotweet Hot, Best Postive Place Jongro HongDae youngsan Word Cloud HongDae Jongro youngsan Best Positive Tweet Happy Pride from Kat! #seoul #gaypride #kqcf2015 #korea #hugagaytoday @ Seoul City Hall Korea https://t.co/81TiNdqCMH #seoulgayprideparade HAPPY PRIDE DAY KOREA!!!! #rainbow #lgbt #love #happy #seoul #korea @ Seoul Plaza https://t.co/FUCkHxmIsc Good times and more Korean BBQ with the Samsung team #MobLabs #GangnamStyle @ Gangnam, Seoul, Korea https://t.co/NyIa440NZ3 Happy Sunday :) @ Myeongdong Cathedral https://t.co/TezVZTVtDH We go by the zoo via the "Elephant Train" to the museum @ Seoul Grand Park Zoo https://t.co/imXCgPrcBG Korean food is the best food #korea #food #nofiilter @ Seoul ,Korea https://t.co/MqVDHqqoEy Have a beautiful and fruitful week IG fam! #MondayLook #mamichoux @ Hongdae Seoul https://t.co/lVM5NdLJyp Happy the 4th of July to all my American friends! (@ Thursday Party in Seoul) https://t.co/CG27beaCQl And with Elizaveta from Russia :) @ Trickeye Museum https://t.co/7NCrGUYOF1 Quick tour of a Korean apartment @ Hongdae Seoul South Korea https://t.co/yTy8mAVCZk .. Conclusion and Future Work • Aanalysis of Geosocial Data is the complex, multidiciplanary process • In this research, present the integrated architecture using Python & FOSS4G • Future work • automated processing with Python scripts • Need more work on QGIS and PySAL for more advanced analysis and visualization