Slide - IAOS 2014 Conference

advertisement
“Re-make / Re-model”:
Should big data change the modelling paradigm in official statistics?
Barteld Braaksma and Kees Zeelenberg
Lay-out of presentation
– Sources and modes of inference
– Big data examples at Statistics Netherlands
– How to use big data?
‐ ‘as is’
‐ models
– But how about quality?
– More examples
– Conclusions
2
Sources for official statistics
Always start from observations
– Traditional surveys
• Statistical populations
• Owned by statistical offices (full control)
• Costly and burdensome
– Administrative sources
• Administrative populations
• Owned by government bodies (limited control)
• Cheaper to obtain
– Big (‘organic’) data
• Unclear populations
• Owned by private companies (no control)
• Cost unclear
3
Modes of inference in official statistics
Main approaches for collecting and processing data
– Design-based
‐ Stratified sample survey of sales
– Model-assisted
‐ Combine tax data with sales survey (regression)
– Model-based
‐
‐
‐
‐
Add up all sales from tax declarations
(small-area estimates)
(seasonal adjustment)
(…)
– Sometimes ‘implicit models’
‐ Imputation of missing values
‐ Preliminary estimates of GDP
4
Big data at Statistics Netherlands
Experiments discussed today
– Traffic detection loops
– Social media messages
– Mobile phone data
Other examples, not discussed here
–
–
–
–
–
Scanner data (in production)
Satellite images
Financial transactions
Internet robots (close to production)
Google Trends
– PM: Administrative data (in production)
5
Traffic detection loops: daily pattern
22
Daytime population based on mobile phone data
Big data ‘as is’
– Imperfect, yet timely, indicator of trends
– “These data exist and that’s why they are interesting”
– Example: social media messages
‐ Signals of human activity and feelings
Dutch social media activity, 2010-2012
8
What are people talking about on Twitter?
9
Sentiment indicator using social media
10
Big data and statistics
Important issues:
–
–
–
–
–
Undercoverage
Selectivity
Volatility
Interpretation
Continuity
Traditionalists’ view:
– These sources are useless for producing quality statistics
Modernists’ view:
– We should stop doing surveys, everything is already out there
Déjà-vu:
– Similar discussions when introducing administrative data…
11
How to use big data?
– Many methodological issues
– No linking variables (often)
– Additional information may be available
– Possible approach: combine available information
‐ By old or new mathematical methods (often Bayesian)
‐ By integration techniques (“National accounts”-style)
– But how about models?
12
Examples of models in official statistics
Correction by weighing for non-response
Imputation for item non-response
Seasonal adjustment
Estimates for small areas
Capture-recapture models for hard to observe
populations
– Preliminary (flash) estimates of GDP
–
–
–
–
–
– So we are already using models in official statistics!
– But we should look carefully at principles and conditions
13
Guiding principles of official statistics
European Statistical System, mission statement
– “We provide the European Union, the world and the public with independent high
quality information on the economy and society on European, national and
regional levels and make the information available to everyone for decisionmaking purposes, research and debate.”
ESS Code of Practice, principle 6:
‐ “Statistical authorities develop, produce and disseminate European Statistics
respecting scientific independence and in an objective, professional and
transparent manner in which all users are treated equitably.”
ESS Code of Practice, principle 7:
– “Sound methodology underpins quality statistics. This requires adequate tools,
procedures and expertise.”
ESS Code of Practice, principle 12:
– “European Statistics accurately and reliably portray reality.”
14
So how about quality?
For use of models this implies:
– Objectivity:
‐ Do not move too far from observed data
‐ Objects and populations for the model correspond to the
statistical phenomenon
‐ No forecasting
– Reliability:
‐ Extensive specification to guarantee robustness against model
failure
‐ No behavioural models
15
Some model-based examples
–
–
–
–
Relation assumed between observations and phenomena
Sophisticated modelling
Trial and error
Signal and noise
16
Bayesian recursive filter (single traffic loop)
17
EMD-filtered monthly rush hour indicator
and expected manufacturing development
18
Google Trends for nowcasting
(Choi & Varian using a Bayesian regression method)
19
Mobile phone data vs. traffic loops:
opportunities for integration?
20
Conclusions
– Big data leads to new opportunities
‐ Better accuracy and more details
‐ More frequent and more timely estimates
‐ Statistics in new areas
– Big data based statistics are useful in their own right
– Don’t be afraid to use models
‐ Documented and transparent
‐ Well tested
‐ Describe, do not judge
21
Download