Statistical Matching in the framework of the modernization of social statistics Aura Leulescu & Emilio Di Meglio EUROSTAT Unit F3 - Living conditions and social protection statistics Key priorities in the EU context to respond to cross-cutting and complex user needs by providing broad indicators on economic well-being and Quality of Life (Stiglitz Report, Europe 2020, GDP and beyond communication, OECD initiative on measuring well-being, etc.); – Demand for a comprehensive and coherent system of socio-economic statistics to go beyond aggregates and capture heterogeneity in the population: multivariate distributions, sub-national statistics, vulnerable sub-groups; – Demand for micro-level statistical information that encompasses both social and economic aspects 2 Premises No single survey can provide all the necessary information No common identifiers allow record linkage at EU level Need for micro (meso)-level integrated statistical information from a coordinated network of surveys and data collection processes at EU level Statistical matching? High potential benefits: – Increased and better use of existing data at minimum costs, – Enhanced conceptual and statistical consistency across surveys, – Development of in house expertise in the domains of data matching transferable to other projects. But also high risks: – Inherent limitations of statistical matching techniques and model-based imputation; – Need to consider both micro level data matching and meso-level data matching (small sub-populations could also be matched). 4 Matching project: 1) Scope This project should: carry-out methodological work, identify and test statistical matching algorithms based on the “fitness for purpose” principle; identify suitable criteria for assessing validity of findings based on both input quality and the robustness of the matching methods proposed; produce methodological guidelines and recommendations for further implementation in Eurostat and/or MSs. 5 Matching project: 2) Investigation streams The project should assess the quality of the results and the relevance of the approach to cover specific needs: Material well-being estimates based on wealth, consumption and income (matching of HFCS, HBS and SILC); Quality of Life indicators that go beyond monetary resources (matching of SILC with LFS and EHIS and outside sources, such as ESS and EQLS); Poverty estimates at regional level, linked to the monitoring of Europe 2020 (matching of data from SILC, EHIS and LFS). 6 Matching project: 3) Timeline I phase: some preliminary analysis focused especially on setting the boundaries for the project – Dec 2010- July 2011 External contract for matching EUSILC, ESS and EQLS – Dec 2010- April 2011 In-house matching exercise (review state of the art & preliminary analysis focused on the reconciliation datasets) II phase – May 2011- Dec 2012 Follow up of the in-house exercise – May 2011 Launch call of tender (according to preliminary results of the three investigation streams) – November 2011 Signature contract(s) – December 2012 Recommendations for implementation 7 Matching project: 4) Organizational aspects The project is expected: – to draw on both external contracts and the development of in-house expertise on matching techniques; – to involve various stakeholders: concerned units in Eurostat, ECB, Eurofound, Commission users (DG EMPL, DG SANCO, DG REGIO) and academic experts; – to develop synergies with ESS initiatives: • Core social variables • ESSnet on Data Integration • ESSnet on Small Area Estimation Matching exercise: ex-ante reconciliation 1 Main purpose: identify specific realistic objectives Identify target variables a) Income, consumption and wealth – HFCS: value of assets and liabilities; – EU-SILC: material deprivation, detailed income; – HBS: food expenditure, leisure goods and services, transport expenditure; b) Quality of life indicators – EQLS/ESS: social capital, quality of society, satisfaction variables – LFS: job quality, training... – SILC: standards of living c) Regional estimates – Impute household disposable equivalized income in LFS Matching exercise ex-ante reconciliation 2 Select matching/ stratification variables – Predictive power (econometric models, correlations, multivariate analysis) – Data quality – Consistency of concepts and statistical content Deal with different weights from the various surveys Define the observation level – Individual – Household – Sub-population What type of auxiliary information we can use to validate results? – – overlap samples (NL); (partial) overlap variables (income classes in EQLS; some material deprivation; food consumption in HFCS) Matching exercise: methods and quality assessment - Preliminary ideas Matching algorithms – Hot deck techniques, regression based, multiple imputation? – Deal with complex survey designs (constraints) – Create synthetic datasets versus estimate parameters (e.g. estimate frequencies by class of income & wealth); How to assess quality/validity? – Checking the marginal and joint distributions of the donor/fused dataset; – Assess probability of good match (ex.: distribution distances donorrecipient) Need to assess the sensitivity of the results to changes in assumptions: – Simulation exercises; auxiliary information; theoretical validation; Some applications: SPSD Canada (Liu& Kovacevic, 1997), ISTAT (Coli et al, 2006) 11