Match

advertisement
The Use of Administrative Sources
for Statistical Purposes
Matching and
Integrating Data from
Different Sources
What is Matching?
• Linking data from different sources
• Exact Matching - linking records from
two or more sources, often using
common identifiers
• Probabilistic Matching - determining
the probability that records from different
sources should match, using a
combination of variables
Why Match?
• Combining data sets can give more
information than is available from
individual data sets
• Reduce response burden
• Build efficient sampling frames
• Impute missing data
• To allow data integration
Models for Data Integration
• Statistical registers
• Statistics from mixed source models
– Split population model
– Split data approach
– Pre-filled questionnaires
– Using administrative data for nonresponders
– Using administrative data for estimation
• Register-based statistical systems
Statistical Registers
Administrative
Sources
Survey
data
Satellite
registers
Statistical
Register
Geographic
information
systems
Other Statistical
Registers
Mixed Source Models
• Traditionally one statistical output was
based on one statistical survey
• Very little integration or coherence
• Now there is a move towards more
integrated statistical systems
• Outputs are based on several sources
Split Population Model
• One source of data for each unit
• Different sources for different parts
of the population
Split Population Model
Population of
Statistical Units
Estimation
Administrative
Data
Statistics
Statistical
Survey
Split Data Approach
• Several sources of data for each unit
Estimation
Unit 1
Administrative
Data
Unit 2
Unit 3
Statistics
Statistical
Survey
Unit n
Pre-filled Questionnaires
• Survey questionnaires are pre-filled
with data from other sources where
possible
• Respondents check that the
information is correct, rather than
completing a blank questionnaire
• This reduces response burden
...... but may introduce a bias!
Example
Manufacture of wooden furniture
Using Administrative Data
for Non-responders
• Administrative data are used directly
to supply variables for units that do
not respond to a statistical survey
• Often used for less important units,
so that response-chasing resources
can be focused on key units
Using Administrative
Data for Estimation
• Administrative data are used as
auxiliary variables to improve the
accuracy of statistical estimation
• Often used to estimate for small subpopulations or small geographic
areas
Population
Register
Jobs and
Other
Activities
Real Estate
Register
Business
Register
Statistical Outputs
Statistical Surveys
Registerbased
Statistical
Systems
Administrative Sources
Statistical Registers
Matching
Terminology
Matching Keys
• Data fields used for matching e.g.
• Reference Number
• Name
• Address
• Postcode/Zip Code/Area Code
• Birth/Death Date
• Classification (e.g. ISIC, ISCO)
• Other variables (age, occupation, etc.)
Distinguishing Power 1
• This relates to the uniqueness of the
matching key
• Some keys or values have higher
distinguishing powers than others
• High - reference number, full name,
full address
• Low - sex, age, city, nationality
Distinguishing Power 2
• Can depend on level of detail
– Born 1960, Paris
– Born 23 June 1960, rue de l’Eglise,
Montmartre, Paris
• Choose variables, or combinations
of variables with the highest
distinguishing power
Match
• A pair that represents the same
entity in reality
A  A
Non-match
• A pair that represents two
different entities in reality
A  B
Possible Match
• A pair for which there is not enough
information to determine whether it
is a match or a non-match
A  a
False Match
• A pair wrongly designated as a
match in the matching process
(false positive)
A = B
False Non-match
• A pair which is a match in reality, but
is designated as a non-match in the
matching process (false negative)
A  A
Matching
Techniques
Clerical Matching
• Requires clerical resources
- Expensive
- Inconsistent
- Slow
- Intelligent
Automatic Matching
• Minimises human intervention
- Cheap
- Consistent
- Quick
- Limited intelligence
The Solution
• Use an automatic matching tool to find
obvious matches and no-matches
• Refer possible matches to specialist
staff
• Maximise automatic matching rates
and minimise clerical intervention
How Automatic
Matching Works
Standardisation
• Generally used for text variables
• Abbreviations and common terms are
replaced with standard text
• Common variations of names are
standardised
• Postal codes, dates of birth etc. are
given a common format
Blocking
• If the file to be matched against is
very large, it may be necessary to
break it down into smaller blocks to
save processing time
– e.g. if the record to be matched is in a
certain town, only match against other
records from that town, rather than all
records for the whole country
Blocking
• Blocking must be used carefully, or
good matches will be missed
• Experiment with different blocking
criteria on a small test data set
• Possible to have two or more
passes with different blocking
criteria to maximise matches
Parsing
• Names and words are broken down
into matching keys
e.g.
Steven Vale  stafan val
Stephen Vael  stafan val
• Improves success rates by allowing
matching where variables are not
identical
Scoring
• Matched pairs are given a score
based on how closely the matching
variables agree
• Scores determine matches, possible
matches and non-matches
Score
100
Matches
x
Possible
Matches
y
Nonmatches
0
How to Determine
X and Y
• Mathematical methods
e.g. Fellegi / Sunter method
• Trial and Error
• Data contents and quality may
change over time so periodic
reviews are necessary
Enhancements
• Re-matching files at a later date
reduces false non-matches (if at least
one file is updated)
• Link to data cleaning software, e.g.
address standardisation
Matching Software
• Commercial products e.g.
SSAName3, Trillium, Automatch
• In-house products e.g. ACTR
(Statistics Canada)
• Open-source products e.g. FEBRL
• No “off the shelf” products - all
require tuning to specific needs
Internet Applications
• Google (and other search engines)
– www.google.com
• Cascot – an automatic coding tool
based on text matching
– http://www2.warwick.ac.uk/fac/soc/ier/publicati
ons/software/cascot/choose_classificatio/
• Address finders e.g. Postes Canada
– http://www.postescanada.ca/tools/pcl/bin/adva
nced-f.asp
Software Applications
• Trigram method applied in SAS code
(freeware) for matching in the Eurostat
business demography project
• Similar approach in UNECE “Data
Locator” search tool
• Works by comparing groups of 3
letters, and counting matching groups
Trigram Method
• Match “Steven Vale”
– Ste/tev/eve/ven/en /n V/ Va/Val/ale
• To “Stephen Vale”
– Ste/tep/eph/phe/hen/en /n V/ Va/Val/ale
– 6 matching trigrams
• And “Stephen Vael”
– Ste/tep/eph/phe/hen/en /n V/ Va/Vae/ael
– 4 matching trigrams
• Parsing would improve these scores
Matching in
Practice
Matching Records Without
a Common Identifier
The UK Experience
by
Steven Vale (Eurostat / ONS)
and Mike Villars (ONS)
The Challenge
• The UK statistical business register
relies on several administrative
sources
• It needs to match records from these
different sources to avoid duplication
• There is no system of common
business identification numbers in UK
The Solution
• Records are matched using business
name, address and post code
• The matching software used is Identity
Systems / SSA-NAME3
• Matching is mainly automatic via batch
processing, but a user interface also
allows the possibility of clerical
matching
Batch Processing 1
• Name is compressed to form a namekey,
the last word of the name is the major key
• Major keys are checked against those of
existing records at decreasing levels of
accuracy until possible matches are found
• The name, address and post codes of
possible matches are compared, and a
score out of 100 is calculated
Batch Processing 2
• If the score is >79 it is considered
to be a definite match
• If the score is between 60 and 79 it
is considered a possible match,
and is reported for clerical checking
• If the score is <60 it is considered a
non-match
Clerical Processing
• Possible matches are checked and linked
where appropriate using an on-line system
• Non-matches with >9 employment are
checked - if no link is found they are sent a
Business Register Survey questionnaire
• Samples of definite matches and smaller
non-matches are checked periodically
Problems Encountered 1
• “Trading as” or “T/A” in the name
e.g. Mike Villars T/A Mike’s Coffee Bar, Bar
would be the major key, but would give too
many matches as there are thousands of
bars in the UK.
• Solution - split the name so that the last
word prior to “T/A” e.g. Villars is the major
key, improving the quality of matches.
Problems Encountered 2
• The number of small non-matched units
grows over time leading to increasing
duplication
• Checking these units is labour intensive
• Solutions
– Fine tune matching parameters
– Re-run batch processes
– Use extra information e.g. legal form /
company number where available
Future Developments
• Clean and correct addresses prior to
matching using “QuickAddress” and the
Post Office Address File
• Links to geographical referencing
• Business Index - plans to link registers of
businesses across UK government
departments
• Unique identifiers?
One Number Census
Matching
by
Ben Humberstone (ONS)
One Number Census
• Aim: To estimate and adjust for
underenumeration in the 2001 Census
• Census Coverage Survey (CCS) - 1%
sample stratified by hard-to-count area
– 320,000 households
– 500,000 people
• 101 Estimation Areas in England and
Wales
ONC Process
Census
CCS
Matching
Dual System
Estimation
Quality
Assurance
Imputation
Adjusted
Census DB
ONC Matching Process
CCS
Census
Exact Matching
Probability
Matching
Key
Green = CCS
Blue = Census
Red = Matched pair
Italics = Automated
Clerical Review
Clerical
Matching
Quality
Assurance
Matched
Records
Data Preparation
• Names
– Soundex used to bring together
different spellings of the same name
• Anderson, Andersen = A536
• Smith, Smyth = S530
• Addresses
– Converted to a numeric/alpha string
• 12a Acacia Avenue = 12AA
• Top Flat, 12 Acacia Ave. = 12AA
Exact Matching
• Data “blocked” at postcode level
• Households matched on key variables
– surname, address name/number,
accommodation type, number of people
• Individuals from within matched
households matched
– forename, surname, day of birth, month
of birth, marital status, relationship to
head of household
Probability Matching
• Block by postcode
• Compare CCS with all Census
households in postcode + neighbouring
postcodes using key variables
• Create matrix according to match weight
CCS
Census
1 Acacia Ave
1 Acacia Ave
1 Acacia Ave
1 Acacia Ave
1 Acacia Ave
1a Acacia Ave
11 Acacia Ave
12 Acacia Ave
Cum. Weight
1450
740
220
112
• Repeat for people within matched
households
Probability Matching
• Matching weights
CCS
Probability
Detached
Semi-detached
Terrace
Detached
+10
-1
-10
Census
Semi-detached
+1
+7
+5
Terrace
-5
-3
+6
• Apply threshold to cumulative weights
• 2 thresholds
– High probability matches
– Low probability matches
Automatic Match Review
• Clerical role
• Matchers presented with all low
probability matches
– Household matches
– Matched individuals within matched
households
• Access to form images to check
scanning
• Basic yes/no operation
Clerical Matching
• Clerical matching of all unmatched
records
• Matchers - perform basic searches,
match or defer
• Experts - carry out detailed searches on
deferred records and review matches
• Quality assurance staff - review experts
work including all unmatchable records
using estimation area wide searches
Quality Assurance
• Experts and Quality Assurance staff
• Double Matching
– Estimation area matched twice,
independently
– Outputs compared, discrepancies checked
• Matching protocol
– Based on best practice
Resources
•
•
•
•
•
•
8 - 10 Matchers
4 - 5 Expert Matchers
2 - 3 Quality Assurance staff
3 Research Officers/Supervisors
1 Senior Research Officer
Computer Assisted Matching System
Quality Assurance
England & Wales
Automatically Matched
Clerically Resolved
Clerically Matched
Unmatched CCS
Excluded CCS
Unmatched Census
Excluded Census
Household
58.8%
13.7%
22.3%
5.0%
0.2%
12.8%
0.0%
Person
51.1%
11.4%
30.7%
6.4%
0.3%
11.7%
0.0%
• False negative rate: < 0.1%
• 1 Estimation area matched per day
Group Discussion
Practical experiences
of data matching
Download