Big Data, Official Statistics and Social Science Research: Emerging Data Challenges Professor Paul Cheung Director, United Nations Statistics Division Building the Global Information System • Elements of a Global Information System: Common Standard, Data Exchange Protocol, Quality Assurance Mechanism, Universal Dissemination Platform, Global Governance Arrangement; • Working with National Statistical Offices to evolve a global statistical system -- Many achievements over 65 years; • Now working with National Geospatial Information Authorities to evolve a global geospatial information platform with common practices and standards; • Imperative to bring these two communities, and other data communities, together to advance an integrated system. Big Data: A BIG Deal? Google search trend 100 80 60 40 20 0 2004 2005 2006 2007 big data 2008 2009 official statistics Source: Google Trends (as of 18 December 2012) 2010 2012 What is Big Data? No fixed definition, still debated • Unstructured, Unregulated • Four Vs: Volume: from Terabyte to Geopbyte Velocity: high speed of data in and out Variety: different formats, integration difficult Variability: data flows highly inconsistent • Complexity: requires data cleansing, linking, and matching the data across systems Multiple Sources of Data • Social Everything! Networking Commenting • Internet uses Online searches Online page-view • Administrative Hospital visits Sales receipts Traffic monitoring • Commercial Cell phone usages Credit card transactions Insurance records Product searches • Health information Electronic medical records Medical monitoring • Satellite imagery • Monitoring systems Google: Predicting the Present Source: Predicting the Present with Google Trends, Choi & Varian, April 2009 Hedonometrics and Twitter Source: Temporal Patterns of Happiness and Information in a Global Social Network: Hedonometrics and Twitter, Dodds et. al., 2011 National Mood (UK) and Twitter 16/11 04/12 25 0 -18 Normalized mood scores for JOY, SADNESS, ANGER and FEAR Source: Mood of Nation [Beta] (http://geopatterns.enm.bris.ac.uk/mood/) Over 1,000,000 outpatient visits per year by MHC Asia Source: http://www.mhcasia.com/managedcare/ A. B. C. D. ONE THOUSAND CLINICS in Singapore Adopted by 90% of insurers in Singapore Linked by Web & Smartphone Apps Smartphone Apps –Virtual membership card & clinic locator 1. 2. 3. 4. 5. 6. 7. 8. Reports- Diagnosis, Financial & Statistical Data Disease pattern & management Infectious Disease Alert Cost Control Drugs usage data lead to bulk purchase Sick Leave control Audit & Frauds detection Email Alerts (High Claim,Sick Leave Alert) Electronic Road Pricing (Singapore) Electronic Road Pricing (Singapore) Source: Interactive map ERP, http://www.onemotoring.com.sg Big Data : Everywhere, Anywhere •The amount of data grows rapidly (approximately 2.5 quintillion bytes created per day) •Everything will be, in some sense, a geospatial beacon, referencing to or generating location information • A hyper-connected environment-estimates suggest over 50 billion things connected by 2020. Real-time Tracking of Population Movement July 4 Macy’s firework Hypothetical data Regular Big Data – Are they Really Useful? • A lot of hype, but used mainly in commercial and security applications • Research and development work are ongoing with great potential • Commercial applications developing the fastest Detecting fraud / Risk Generating consumer profile Reducing medical care cost Changing travelling and consumption patterns New Data, New Methods • Data deluge makes scientific methods obsolete?? • Official statistics depends on classical statistical methods?? • Are social science data models and methods obsolete?? Big Data vs Official Statistics Official Statistics are Structured Data with Unique Identity Population Characteristics Company Profits/Losses Population Census Survey of Companies Census Questionnaire Company Balance Sheet Statistical Analysis Statistical Analysis Big Data and Social Sciences Research Statistical vs Structural Inference Incorporating Big Data in Official Statistics • Could Big Data replace traditional data sources? Not reliable source at this moment Limitations (non-representativeness, unreliability) Important as collaborating evidence Huge potential: faster, cheaper data New data sources could replace traditional sources? Data-mining with multiple sources of data for new insights Improving Data Sources in Official Statistics • A lot of work has been done in official statistics: Common Standard, Data Exchange Protocol, Quality Assurance Mechanism, Universal Dissemination Platform • New emphasis in Data Sources Multi-mode data collection Internet based surveys Administrative sources • Too much emphasis on surveys and traditional approaches • Imperative to review appropriateness of Big Data to assess fit for purpose of official statistics. University of Michigan Consumer Sentiment Index: Google Prediction Consumer Sentiment Index Current Economic Condition Index Consumer Expectations Index Source: Consumer Sentiment with Google Trends, Choi, Google Inc. Conference on Empirical Macroeconomics Using Geographical Data, March 2011 Consumer Sentiment Index Google Search - Starbucks Franchise May-12 Jan-12 Sep-11 May-11 Jan-11 Sep-10 May-10 Jan-10 Sep-09 May-09 Jan-09 Sep-08 May-08 Jan-08 Sep-07 May-07 Jan-07 Sep-06 May-06 Jan-06 Sep-05 May-05 Jan-05 Sep-04 May-04 Jan-04 Predicting Consumer Sentiment Index 2.5 2 1.5 1 0.5 0 -0.5 -1 -1.5 -2 -2.5 Google Trend and Unemployment Rate Source: Consumer Sentiment with Google Trends, Choi, Google Inc. Conference on Empirical Macroeconomics Using Geographical Data, March 2011 Predicting Insurance Claims 100 1200000 90 1000000 80 70 800000 60 600000 50 40 400000 30 20 200000 10 0 2004 0 2005 2006 2007 Initial claim of Unemployment Insurance 2008 2009 2010 2011 Google search unemployment+social security+welfare 2012 The Billion Prices Project @ MIT • Pricing Behavior: What drives price stickiness around the world? How much can be explained by current inflation, and inflation histories? How much by competition and industries’ structure? • Daily Inflation and Asset Prices: Construct daily inflation indexes across countries and sectors and study their ability to match official statistics. • Pass-Through: How much do prices adjust internally when the exchange rate, or the international price of commodities change? • Markups: What premium is paid in stores for “green” or “organic” products? With data from multinational retailers, compute premium differences -for exactly the same items- in different places. The Billion Prices Project @ MIT, http://bpp.mit.edu/ Argentina Aggregate Inflation Series Source: www.pricestats.com/arindex.html Mobile Phone Positioning Data for Tourism Statistics Source: Mobile Telephones and Mobile Positioning data as source for statistics: Estonian Experiences, Ahas et. Al. (2011) Intuit Small Business Employment Indexes Source: http://index.intuit.com/ Big Data as Data Source for Research Traditional Data on Social Network Snow-ball approach, from person to person, rich information on inter-personal relations Source: Reality Mining, http://reality.media.mit.edu/soc.php Big Data on Social Network Large number of people and connections Real-time Community Crime Data Source: https://www.crimereports.com/ Big Data and Representativeness • What is the population? Who generates the data? • Can we draw a sample and infer population traits? • Patterns may reflect what is happening but the ‘reference population’ is not clear • Inferential Statistics not possible; hence the use of non-parametric analytics Big Data: Who Generates the Data? Representative? Demographics of Twitter Users Source: The State of Twitter 2012 [STATS], 3 August 2012 Big Data and Social Reality • Does Big Data reflect social reality Do the data reveal random or real patterns? Are the data representative? What is the real meaning of the data? Do the data reflect social patterns or structures? • An example: Social network study Articulated social networks – list of friends on Facebook Behavioural network – communication patterns and cell coordinates Big Data and Verifiability • Can the data be verified and re-tested? • Many big data are considered “private”, not available to larger academic community for repeated analysis • Equal data access needed for Making scientific replication studies Preventing fraudulent publications Big Data and Confidentiality • Confidentiality a big issue. Traditional anonymization might not work well • Geocoding statistical information creates new concerns • Sharing continuous time and cell phone location information from a city is a problem • Google privacy policy update (1 March 2012): linking a person via multiple Google products – collecting across platforms information on health, political opinions and financial concerns • Demands for precise, location-based information pushes the boundary of confidentiality New types of research data about human behavior and society pose many opportunities if crucial infrastructural challenges are tackled. G King Science 2011;331:719-721 Using Big Data in Social Science New Tools and Procedures required for: • Data preparation/cleaning • Data reduction • Data mining Searching for patterns and/or relationships Building the “best” model Apllying the “best” model to a new dataset to classify or estimate (machine learning) How/what to teach the machine? Big Data and Computational Challenge Computational challenge • Generating manageable structured data from unstructured data • Integrating big data processing with statistical analysis tools Learning to Use Big Data Training required • Nonstandard data types • Computational methods • Protection of data confidentiality • Legal protocols • Data sharing norms • Statistical tools The Way Forward • Big Data will become more prominent in years to come. • Statisticians and Social Scientists should take advantage of new data source. • Computation and quantitative analytical skills become important. • Data must generate insights and knowledge: This is the ultimate goal. • We must decipher truth vs falsehood.