Methodological challenges in integrating data collections in

advertisement
Methodological challenges in
integrating data collections in
business statistics
Paul Smith
Office for National Statistics
Outline
• Data quality for different sources
quality measures for survey and administrative
inputs
quality measures for outputs
• Combinations of sources
familiar and more advanced situations
• Mode effects
• Models
• Discussion
Statistical data collections - quality
• Relevance
generally questions conform to desired concepts
may be tailoring for
• practicality
• consistency across collections even if concepts differ
• Accuracy
affected by sampling
impacts from non-response, measurement error
• Timeliness
generally relatively timely
Administrative data - quality
• Relevance
questions conform to administrative (not statistical)
concepts
few concessions to statistical needs
• Accuracy
unaffected by sampling
processes to discourage non-response
treatment of measurement error differs by variable
• Timeliness
generally slow
Differences between types of source
• Sampling accuracy is measurable for
surveys, not relevant for administrative data
sources
confidence in quality reduced for admin data
balance of accuracy measures different
• Building statistical requirements into
administrative series
requires negotiation and agreement
VAT classification information in the UK
INSEE has statistical and accounting information
well integrated
Questionnaire design
• Questionnaire design principles mostly used
in designing statistical collections
• Administrative data seen as “forms” not
“questionnaires”
less attention to question phrasing to obtain required
answer
more on statutory requirements
Output data quality
• Data quality from combined outputs can be
challenging to measure
function of the qualities of the input sources, and the
methods used to combine them
some well-known general approaches
development of measures needed for particular
cases (eg from models)
Combinations of sources - 1
• Frame and sample information
Sampling frames typically derived from
administrative sources
Multiple uses of frame information
•
•
•
•
sample design
sample selection
validation and editing
estimation and variance estimation
Quality easily derived – standard situation
Combinations of sources - 2
• Dual-frame surveys
More than one administrative source
•
•
•
•
Pension funds survey in the UK
Units
Business register
Challenges of population inflation if matching not perfect
Estimate probability that unit appears in sample from
either source
• use in appropriate weighting procedure
• adjustment for P(in both surveys) depends on survey
type
Combinations of sources - 3
• Multiple surveys
different periodicity
• summary information monthly, detail annually
• for example capital expenditure – quarterly breakdown,
annual summary
• Benchmarking
where short-period surveys small (and variable) and
annual larger (and less variable)
• Quality measures
account for sampling error in both sources
account for non-response and measurement errors
in larger survey
Combinations of sources - 4
• Auxiliary information
If administrative concept not close to statistical
concept, data may still be useful
Auxiliary information in estimation
• not required to be correct, only correlated with outcome
• the better the correlation, the better the accuracy
Auxiliary information in validation
• use tax data to improve validation follow-up activity
• Data confrontation
Use multiple sources to identify discrepancies
Balancing
Mode effects
• Mode effects manifest in several ways
differences in contact rate
differences in response rate given contact
differences in question replies given response
• Test differences through a designed
experiment (van den Brakel & Renssen 1998,
2005)
evaluates whole-process differences (not individual
steps)
non-response adjustment if good predictors for
response amongst auxiliary data (var increases)
model-based adjustments for other changes
Temporal differences
• Administrative data often have longer
reference period than statistical requirement
• Implies temporal disaggregation (modelbased) – Dagum & Cholette 2006
• Quality implications
estimated data as inputs
sensitivity of model to interesting changes
Models for combining data
• Full flexibility in combining data available
through modelling approach
• Models at boundary between statistical
producer and user
• Ideally statistical results insensitive to model
assumptions
small area estimates
• useful for social surveys
• challenges for business surveys not yet resolved
modelling for unit structures - BRES
Discussion
• Aim: more from existing sources
often imperfect matches
modelling only appropriate approach
• subjective
• robust to assumptions
• sensitivity analysis
• Mixed mode collections
usability and low cost
data combination
quality components harder to measure
• for more details see the paper, or contact
paul.smith@ons.gov.uk
Download