Big Data, Official Statistics and Social Science Research: Emerging Data Challenges

advertisement
Big Data, Official Statistics and
Social Science Research:
Emerging Data Challenges
Professor Paul Cheung
Director, United Nations Statistics Division
Building the Global Information System
• Elements of a Global Information System: Common
Standard, Data Exchange Protocol, Quality Assurance
Mechanism, Universal Dissemination Platform, Global
Governance Arrangement;
• Working with National Statistical Offices to evolve a global
statistical system -- Many achievements over 65 years;
• Now working with National Geospatial Information
Authorities to evolve a global geospatial information
platform with common practices and standards;
• Imperative to bring these two communities, and other data
communities, together to advance an integrated system.
Big Data: A BIG Deal?
Google search trend
100
80
60
40
20
0
2004
2005
2006
2007
big data
2008
2009
official statistics
Source: Google Trends (as of 18 December 2012)
2010
2012
What is Big Data?
No fixed definition, still debated
• Unstructured, Unregulated
• Four Vs:
Volume: from Terabyte to Geopbyte
Velocity: high speed of data in and out
Variety: different formats, integration difficult
Variability: data flows highly inconsistent
• Complexity: requires data cleansing, linking, and
matching the data across systems
Multiple Sources of Data
• Social Everything!
Networking
Commenting
• Internet uses
Online searches
Online page-view
• Administrative
Hospital visits
Sales receipts
Traffic monitoring
• Commercial
Cell phone usages
Credit card
transactions
Insurance records
Product searches
• Health information
Electronic medical
records
Medical monitoring
• Satellite imagery
• Monitoring systems
Google: Predicting the Present
Source: Predicting the Present with Google Trends, Choi & Varian, April 2009
Hedonometrics and Twitter
Source: Temporal Patterns of Happiness and Information in a Global Social Network:
Hedonometrics and Twitter, Dodds et. al., 2011
National Mood (UK) and Twitter
16/11
04/12
25
0
-18
Normalized mood scores for JOY, SADNESS, ANGER and FEAR
Source: Mood of Nation [Beta] (http://geopatterns.enm.bris.ac.uk/mood/)
Over 1,000,000 outpatient visits per year by MHC Asia
Source: http://www.mhcasia.com/managedcare/
A.
B.
C.
D.
ONE THOUSAND CLINICS in Singapore
Adopted by 90% of insurers in Singapore
Linked by Web & Smartphone Apps
Smartphone Apps –Virtual membership card
& clinic locator
1.
2.
3.
4.
5.
6.
7.
8.
Reports- Diagnosis, Financial & Statistical Data
Disease pattern & management
Infectious Disease Alert
Cost Control
Drugs usage data lead to bulk purchase
Sick Leave control
Audit & Frauds detection
Email Alerts (High Claim,Sick Leave Alert)
Electronic Road Pricing (Singapore)
Electronic Road Pricing (Singapore)
Source: Interactive map ERP, http://www.onemotoring.com.sg
Big Data : Everywhere, Anywhere
•The amount of data grows rapidly (approximately
2.5 quintillion bytes created per day)
•Everything will be, in some sense, a geospatial
beacon, referencing to or generating location
information
• A hyper-connected environment-estimates suggest
over 50 billion things connected by 2020.
Real-time Tracking of Population Movement
July 4 Macy’s firework
Hypothetical data
Regular
Big Data – Are they Really Useful?
• A lot of hype, but used mainly in commercial and
security applications
• Research and development work are ongoing
with great potential
• Commercial applications developing the fastest
Detecting fraud / Risk
Generating consumer profile
Reducing medical care cost
Changing travelling and consumption patterns
New Data, New Methods
• Data deluge makes scientific methods
obsolete??
• Official statistics depends on classical
statistical methods??
• Are social science data models and
methods obsolete??
Big Data vs Official Statistics
Official Statistics are Structured Data with Unique Identity
Population
Characteristics
Company
Profits/Losses
Population Census
Survey of
Companies
Census
Questionnaire
Company Balance
Sheet
Statistical Analysis
Statistical Analysis
Big Data and Social Sciences Research
Statistical vs Structural Inference
Incorporating Big Data in Official Statistics
• Could Big Data replace traditional data sources?
Not reliable source at this moment
Limitations (non-representativeness, unreliability)
Important as collaborating evidence
Huge potential: faster, cheaper data
New data sources could replace traditional
sources?
Data-mining with multiple sources of data for new
insights
Improving Data Sources in Official Statistics
• A lot of work has been done in official statistics: Common
Standard, Data Exchange Protocol, Quality Assurance
Mechanism, Universal Dissemination Platform
• New emphasis in Data Sources
Multi-mode data collection
Internet based surveys
Administrative sources
• Too much emphasis on surveys and traditional approaches
• Imperative to review appropriateness of Big Data to assess
fit for purpose of official statistics.
University of Michigan Consumer Sentiment Index:
Google Prediction
Consumer Sentiment Index
Current Economic Condition Index
Consumer Expectations Index
Source: Consumer Sentiment with Google Trends, Choi, Google Inc. Conference on Empirical Macroeconomics Using
Geographical Data, March 2011
Consumer Sentiment Index
Google Search - Starbucks Franchise
May-12
Jan-12
Sep-11
May-11
Jan-11
Sep-10
May-10
Jan-10
Sep-09
May-09
Jan-09
Sep-08
May-08
Jan-08
Sep-07
May-07
Jan-07
Sep-06
May-06
Jan-06
Sep-05
May-05
Jan-05
Sep-04
May-04
Jan-04
Predicting Consumer Sentiment Index
2.5
2
1.5
1
0.5
0
-0.5
-1
-1.5
-2
-2.5
Google Trend and Unemployment Rate
Source: Consumer Sentiment with Google Trends, Choi, Google Inc. Conference on Empirical Macroeconomics Using
Geographical Data, March 2011
Predicting Insurance Claims
100
1200000
90
1000000
80
70
800000
60
600000
50
40
400000
30
20
200000
10
0
2004
0
2005
2006
2007
Initial claim of Unemployment Insurance
2008
2009
2010
2011
Google search unemployment+social security+welfare
2012
The Billion Prices Project @ MIT
• Pricing Behavior: What drives price stickiness around the world?
How much can be explained by current inflation, and inflation
histories? How much by competition and industries’ structure?
• Daily Inflation and Asset Prices: Construct daily inflation indexes
across countries and sectors and study their ability to match official
statistics.
• Pass-Through: How much do prices adjust internally when the
exchange rate, or the international price of commodities change?
• Markups: What premium is paid in stores for “green” or “organic”
products? With data from multinational retailers, compute premium
differences -for exactly the same items- in different places.
The Billion Prices Project @ MIT, http://bpp.mit.edu/
Argentina Aggregate Inflation Series
Source: www.pricestats.com/arindex.html
Mobile Phone Positioning Data for
Tourism Statistics
Source: Mobile Telephones and Mobile Positioning data as source for statistics:
Estonian Experiences, Ahas et. Al. (2011)
Intuit Small Business Employment Indexes
Source: http://index.intuit.com/
Big Data as Data Source for Research
Traditional Data on Social Network
Snow-ball approach, from
person to person, rich
information on inter-personal
relations
Source: Reality Mining, http://reality.media.mit.edu/soc.php
Big Data on Social Network
Large number of people and
connections
Real-time Community Crime Data
Source: https://www.crimereports.com/
Big Data and Representativeness
• What is the population? Who generates
the data?
• Can we draw a sample and infer
population traits?
• Patterns may reflect what is happening
but the ‘reference population’ is not clear
• Inferential Statistics not possible; hence
the use of non-parametric analytics
Big Data: Who Generates the Data?
Representative? Demographics of Twitter Users
Source: The State of Twitter 2012 [STATS], 3 August 2012
Big Data and Social Reality
• Does Big Data reflect social reality
Do the data reveal random or real patterns?
Are the data representative?
What is the real meaning of the data?
Do the data reflect social patterns or structures?
• An example: Social network study
Articulated social networks – list of friends on
Facebook
Behavioural network – communication patterns and
cell coordinates
Big Data and Verifiability
• Can the data be verified and re-tested?
• Many big data are considered “private”, not
available to larger academic community for
repeated analysis
• Equal data access needed for
Making scientific replication studies
Preventing fraudulent publications
Big Data and Confidentiality
• Confidentiality a big issue. Traditional anonymization
might not work well
• Geocoding statistical information creates new
concerns
• Sharing continuous time and cell phone location
information from a city is a problem
• Google privacy policy update (1 March 2012): linking
a person via multiple Google products – collecting
across platforms information on health, political
opinions and financial concerns
• Demands for precise, location-based information
pushes the boundary of confidentiality
New types of research data about human behavior and society pose many
opportunities if crucial infrastructural challenges are tackled.
G King Science 2011;331:719-721
Using Big Data in Social Science
New Tools and Procedures required for:
• Data preparation/cleaning
• Data reduction
• Data mining
Searching for patterns and/or relationships
Building the “best” model
Apllying the “best” model to a new dataset to
classify or estimate (machine learning)
How/what to teach the machine?
Big Data and Computational Challenge
Computational challenge
• Generating manageable
structured data from
unstructured data
• Integrating big data
processing with statistical
analysis tools
Learning to Use Big Data
Training required
• Nonstandard data types
• Computational methods
• Protection of data confidentiality
• Legal protocols
• Data sharing norms
• Statistical tools
The Way Forward
• Big Data will become more prominent in
years to come.
• Statisticians and Social Scientists should
take advantage of new data source.
• Computation and quantitative analytical
skills become important.
• Data must generate insights and
knowledge: This is the ultimate goal.
• We must decipher truth vs falsehood.
Download