“Re-make / Re-model”: Should big data change the modelling paradigm in official statistics? Barteld Braaksma and Kees Zeelenberg Lay-out of presentation – Sources and modes of inference – Big data examples at Statistics Netherlands – How to use big data? ‐ ‘as is’ ‐ models – But how about quality? – More examples – Conclusions 2 Sources for official statistics Always start from observations – Traditional surveys • Statistical populations • Owned by statistical offices (full control) • Costly and burdensome – Administrative sources • Administrative populations • Owned by government bodies (limited control) • Cheaper to obtain – Big (‘organic’) data • Unclear populations • Owned by private companies (no control) • Cost unclear 3 Modes of inference in official statistics Main approaches for collecting and processing data – Design-based ‐ Stratified sample survey of sales – Model-assisted ‐ Combine tax data with sales survey (regression) – Model-based ‐ ‐ ‐ ‐ Add up all sales from tax declarations (small-area estimates) (seasonal adjustment) (…) – Sometimes ‘implicit models’ ‐ Imputation of missing values ‐ Preliminary estimates of GDP 4 Big data at Statistics Netherlands Experiments discussed today – Traffic detection loops – Social media messages – Mobile phone data Other examples, not discussed here – – – – – Scanner data (in production) Satellite images Financial transactions Internet robots (close to production) Google Trends – PM: Administrative data (in production) 5 Traffic detection loops: daily pattern 22 Daytime population based on mobile phone data Big data ‘as is’ – Imperfect, yet timely, indicator of trends – “These data exist and that’s why they are interesting” – Example: social media messages ‐ Signals of human activity and feelings Dutch social media activity, 2010-2012 8 What are people talking about on Twitter? 9 Sentiment indicator using social media 10 Big data and statistics Important issues: – – – – – Undercoverage Selectivity Volatility Interpretation Continuity Traditionalists’ view: – These sources are useless for producing quality statistics Modernists’ view: – We should stop doing surveys, everything is already out there Déjà-vu: – Similar discussions when introducing administrative data… 11 How to use big data? – Many methodological issues – No linking variables (often) – Additional information may be available – Possible approach: combine available information ‐ By old or new mathematical methods (often Bayesian) ‐ By integration techniques (“National accounts”-style) – But how about models? 12 Examples of models in official statistics Correction by weighing for non-response Imputation for item non-response Seasonal adjustment Estimates for small areas Capture-recapture models for hard to observe populations – Preliminary (flash) estimates of GDP – – – – – – So we are already using models in official statistics! – But we should look carefully at principles and conditions 13 Guiding principles of official statistics European Statistical System, mission statement – “We provide the European Union, the world and the public with independent high quality information on the economy and society on European, national and regional levels and make the information available to everyone for decisionmaking purposes, research and debate.” ESS Code of Practice, principle 6: ‐ “Statistical authorities develop, produce and disseminate European Statistics respecting scientific independence and in an objective, professional and transparent manner in which all users are treated equitably.” ESS Code of Practice, principle 7: – “Sound methodology underpins quality statistics. This requires adequate tools, procedures and expertise.” ESS Code of Practice, principle 12: – “European Statistics accurately and reliably portray reality.” 14 So how about quality? For use of models this implies: – Objectivity: ‐ Do not move too far from observed data ‐ Objects and populations for the model correspond to the statistical phenomenon ‐ No forecasting – Reliability: ‐ Extensive specification to guarantee robustness against model failure ‐ No behavioural models 15 Some model-based examples – – – – Relation assumed between observations and phenomena Sophisticated modelling Trial and error Signal and noise 16 Bayesian recursive filter (single traffic loop) 17 EMD-filtered monthly rush hour indicator and expected manufacturing development 18 Google Trends for nowcasting (Choi & Varian using a Bayesian regression method) 19 Mobile phone data vs. traffic loops: opportunities for integration? 20 Conclusions – Big data leads to new opportunities ‐ Better accuracy and more details ‐ More frequent and more timely estimates ‐ Statistics in new areas – Big data based statistics are useful in their own right – Don’t be afraid to use models ‐ Documented and transparent ‐ Well tested ‐ Describe, do not judge 21