Methodological challenges in integrating data collections in business statistics Paul Smith Office for National Statistics Outline • Data quality for different sources quality measures for survey and administrative inputs quality measures for outputs • Combinations of sources familiar and more advanced situations • Mode effects • Models • Discussion Statistical data collections - quality • Relevance generally questions conform to desired concepts may be tailoring for • practicality • consistency across collections even if concepts differ • Accuracy affected by sampling impacts from non-response, measurement error • Timeliness generally relatively timely Administrative data - quality • Relevance questions conform to administrative (not statistical) concepts few concessions to statistical needs • Accuracy unaffected by sampling processes to discourage non-response treatment of measurement error differs by variable • Timeliness generally slow Differences between types of source • Sampling accuracy is measurable for surveys, not relevant for administrative data sources confidence in quality reduced for admin data balance of accuracy measures different • Building statistical requirements into administrative series requires negotiation and agreement VAT classification information in the UK INSEE has statistical and accounting information well integrated Questionnaire design • Questionnaire design principles mostly used in designing statistical collections • Administrative data seen as “forms” not “questionnaires” less attention to question phrasing to obtain required answer more on statutory requirements Output data quality • Data quality from combined outputs can be challenging to measure function of the qualities of the input sources, and the methods used to combine them some well-known general approaches development of measures needed for particular cases (eg from models) Combinations of sources - 1 • Frame and sample information Sampling frames typically derived from administrative sources Multiple uses of frame information • • • • sample design sample selection validation and editing estimation and variance estimation Quality easily derived – standard situation Combinations of sources - 2 • Dual-frame surveys More than one administrative source • • • • Pension funds survey in the UK Units Business register Challenges of population inflation if matching not perfect Estimate probability that unit appears in sample from either source • use in appropriate weighting procedure • adjustment for P(in both surveys) depends on survey type Combinations of sources - 3 • Multiple surveys different periodicity • summary information monthly, detail annually • for example capital expenditure – quarterly breakdown, annual summary • Benchmarking where short-period surveys small (and variable) and annual larger (and less variable) • Quality measures account for sampling error in both sources account for non-response and measurement errors in larger survey Combinations of sources - 4 • Auxiliary information If administrative concept not close to statistical concept, data may still be useful Auxiliary information in estimation • not required to be correct, only correlated with outcome • the better the correlation, the better the accuracy Auxiliary information in validation • use tax data to improve validation follow-up activity • Data confrontation Use multiple sources to identify discrepancies Balancing Mode effects • Mode effects manifest in several ways differences in contact rate differences in response rate given contact differences in question replies given response • Test differences through a designed experiment (van den Brakel & Renssen 1998, 2005) evaluates whole-process differences (not individual steps) non-response adjustment if good predictors for response amongst auxiliary data (var increases) model-based adjustments for other changes Temporal differences • Administrative data often have longer reference period than statistical requirement • Implies temporal disaggregation (modelbased) – Dagum & Cholette 2006 • Quality implications estimated data as inputs sensitivity of model to interesting changes Models for combining data • Full flexibility in combining data available through modelling approach • Models at boundary between statistical producer and user • Ideally statistical results insensitive to model assumptions small area estimates • useful for social surveys • challenges for business surveys not yet resolved modelling for unit structures - BRES Discussion • Aim: more from existing sources often imperfect matches modelling only appropriate approach • subjective • robust to assumptions • sensitivity analysis • Mixed mode collections usability and low cost data combination quality components harder to measure • for more details see the paper, or contact paul.smith@ons.gov.uk