Census Population Estimates - Office for National Statistics

advertisement
GSS Methodology Symposium, 6 July 2011
Abstract
Title: Linkage of administrative sources to validate Census population estimates
This paper describes how record-level administrative data are being used to help validate Census population
estimates. The innovative data matching approach described here is just one component of a comprehensive
data quality assurance strategy for the 2011 Census. The focus here is the validation of Census estimates of
undercount. A matrix of linked information on addresses and the people living in them, drawn from
administrative data, can shed new light on households and individuals that Census processes may have
missed. The administrative sources will be described, together with the methods being used to clean and link
them. This paper also describes how new Census information on second residences and visitors, together with
addresses one year ago, will support the interpretation of linked records and of unmatched residuals. These
results and methods draw on a series of data matching pilots designed to strengthen our understanding of
available administrative sources and their quality1.
Keywords: Census, Administrative Sources, record matching
Louisa Blackwell
Louisa.blackwell@ons.gsi.gov.uk
0132944 4539
Nicky Rogers
Nicola.rogers@ons.gsi.gov.uk
0132944 4866
1
Quality of administrative data sources was assessed in terms of their statistical quality and their use to validate or
improve the Census population estimates.
1
1. Introduction and background
1.1. Introduction
This paper is concerned with the use of administrative data sources to validate 2011 Census population
estimates. Administrative data will be used as comparators for Census counts and estimates, for detailed
investigations of data discrepancies at low-level geographies and for record-level matching. Record matching
will only be invoked where there are discrepancies between Census population estimates and comparator
datasets that cannot be explained at the aggregate level.
The 2011 Census data quality assurance (QA) strategy and QA Methodology2 sets out a wide range of checks
and data comparisons that will ensure that Census estimates at local authority, regional and national level are
accurate. These are summarised in the ‘Background’ section below. We then describe the administrative data
to be used in the QA process. A critical consideration in the statistical use of administrative data is that this
information was primarily collected for other purposes. Data preparation for aggregate-level comparison and
for data matching involves data cleaning as well as standardisation and synchronisation, which we describe in
the ‘Methods’ section below. Synchronising Census and administrative sources can be challenging, given the
presence of source-specific time lags between an individual’s interaction with the administrative system, and
the availability of that information for analysis at ONS.
The 2011 Census Data Matching Team has conducted a series of case studies to investigate how the statistical
quality issues associated with using administrative sources to validate Census population estimates vary in
different types of local authority (LA). We present some examples from these case studies, as they both
highlight the utility of administrative sources for probing data discrepancies, and their limitations, particularly
in areas that have high population churn.
In examining administrative data for the case study LAs, we asked:
1. To what extent do the counts of addresses vary between the Address Register, Council Tax, the
Patient Register, Electoral Register and Valuation Office Agency data?
2. Do these patterns vary in different types of local authority?
3. How does the coverage of administrative sources in Inner London compare with urban and rural areas
with less population flux?
4. What can the aggregate-level comparisons tell us about the data matching strategy to be followed?
5. What data quality issues can be tackled ahead of data quality assurance and data matching?
6. Are some sources better at counting the number and characteristics of particular age-sex groups than
others?
The results of this analysis have shaped the approach taken to record-level matching. Conversely, record
matching will shed further light on these results. For example, the only way to check that consistent address
or person counts in different sources refer to the same entities is to match the individual records and evaluate
the unmatched residuals.
Resource constraints meant that this analysis was restricted to eight local authorities. Data matching will only
be used in local authorities that pose the greatest enumeration challenge. To capture the salient characteristics
of administrative sources in this type of area, and to gain early substantive insights into LAs likely to require
supplementary analysis, four of the eight selected LAs are in Inner London. Two LAs that pose less extreme
but pose their own unique enumeration challenges were also included. By contrast, two of the least
challenging LAs were also analysed.
Address-based data comparisons focus on data discrepancies in Census Coverage Survey (CCS) postcode
clusters and in all postcodes within each LA. The focus on CCS postcode clusters is because when a
2
Office for National Statistics 2011 Census Data Quality Assurance Strategy June 2009
2
population estimate for a particular LA is challenged and the issue(s) cannot be resolved with reference to
aggregate-level data, the data matching team will consider whether administrative sources can shed new light
on the areas within the LA that were used for dual system estimation. In CCS postcodes we also have most
information on Census-day populations.
Person-level data were disaggregated by age and sex so that the representativeness of the administrative
sources for different population sub-groups and for males and females within these could be explored. There
are gendered differences in interactions with the NHS Register, for example young women are more likely to
register with and consult GPs than young men. This could create an undercount in the Patient Register of
young men. CIS address data of women bringing up children could very easily be more up to date than that
for men if benefit or tax credits are being claimed. Such patterns may be visible, particularly when compared
against ONS’ mid-year population estimates.
1.2. Background
Administrative data will be used at varying levels of aggregation to quality assure Census population
estimates. An overview of the Census data quality assurance strategy is provided in Wroth-Smith et al
(2011)3. This is summarised in Figure 1. The quality assurance process is organised into three stages: core
checks on the data, supplementary checks and, for data anomalies that cannot be understood using aggregatelevel data, record matching. Core and supplementary checks will be carried out at local authority and lower
levels of geography. There will also be regional and national-level checks, using data comparators and
demographic indicators, to ensure that the emerging and final regional and national estimates are accurate.
Core and supplementary analysis may include a focus on population sub-groups that pose particular
enumeration challenges, such as migrants or students, or on groups that the checking process has highlighted
as under-or over-represented.
Wroth-Smith, J., Abbott, O., Compton, G. and Benton, P. (2011) ‘Quality Assuring the 2011 Census Population
Estimates’, Population Trends 143: 1:9.
3
3
Figure 1 Quality Assurance Overview
It is efficient to compare Census and administrative data at the aggregate level because data aggregates can be
anonymised and comparisons can be quickly automated to deliver easy-to-interpret results. However, there
could be aggregate-level discrepancies between two or more sources that cannot be explained. The
differences could reflect a localised breakdown in Census field or estimation processes. To understand and
reconcile such differences, new insights may be gained by matching records for addresses and individuals
within a particular area.
Matching administrative sources to Census data will:
 Identify addresses missed or misclassified by Census
 Inform the alternative household count, used for coverage estimation
 Assess dual system estimates through comparisons of address and person counts in CCS postcode
clusters
 Validate address and person counts in small geographic pockets with high Census non-response
 Identify people that Census missed within counted households
 Check counts and estimates for key population sub-groups (such as babies, school children, higher
education students and immigrants)
Data matching will be only undertaken within the secure and legally compliant business systems and
processes that have been built for 2011 Census.
4
1.3 Guiding principles for record matching
The Census data matching strategy has been developed to focus on underenumeration, that is households and
individuals that may have been missed by Census processes. Other issues such as Census overcount are dealt
with in other strands of Census estimation and the quality assurance strategy. In addition, newly available
2011 Census information on date of arrival, intended length of stay, second residences, visitors (and their
usual addresses), together with respondents’ usual address one year ago will be used to help reconcile
unmatched residuals from administrative data.
Interpretation of the results from data matching will be an iterative and empirically-driven process. The
Address Register and its validation, Census enumeration, the Census Coverage Survey and Census data
processing provide a uniquely rich snapshot of people and places in March 2011 against which administrative
data can be compared. Through matching, knowledge will be gained incrementally. For example, address
matching ahead of Census will structure and simplify individual-level matching of Census response data. The
characteristics of administrative records linked to Census returns will in turn aid the interpretation of records
in the unmatched residuals.
Data matching and the results will be continuously updated as new data and information become available.
For example, Higher Education Statistics Agency (HESA) data relevant to the 2011 Census day only become
available in January 2012. Until that date, postcode data for 2010 will be used and will be updated with the
later release. Postcode data will be used because HESA data are not supplied with full addresses, only
postcode. We will be unable to use person-based HESA data before January 2012 because data currently
available to ONS relate to the 2009/10 academic year.
The data matching strategy uses data sources that are available to ONS at record level, either through data
sharing agreements with the respective data owners or because these data are generally available. Some data
(for example Council Tax data) are only available as aggregates, supplied directly by LAs so their use is
restricted to aggregate-level comparison of addresses. Microdata that become newly available during data
matching will be integrated, as far as possible into the methodology.
This paper does not consider microdata to be used in data matching that cover subsets of the population, such
as HESA data, School Census data and the DWP Migrant Worker Scan.
5
2. Methods
2.1. Methods for Census record matching
It is critical to build an understanding of the strengths and limitations of Administrative sources, with
particular reference to comparability with Census definitions. Some key considerations:
 Population coverage. For example, the School Census captures important information for children in
state-funded schools, but not those in Independent schools or 16-18 year-olds in Further Education
colleges.
 Aligning reference periods between administrative sources and Census day. Time lags in the
administrative systems mean there is sometimes a considerable delay in being able to access data for
the relevant period. For example, HESA data for March 27th 2011 (Census day) will not be available
for analysis in ONS until January 2012.
 Definitional differences need to be identified and understood. Census uses questions on arrival in the
UK and intended length of stay to identify short-term migrants (present for 3-12 months), who will be
excluded from the ‘usually resident’ population. However administrative sources do not collect this
information and may therefore appear inflated when compared to Census counts and estimates.
 Presence and completeness of data analysis and matching information. Variables such as name,
address, date of birth and sex need to available, complete and of good quality to support data
matching.
 Identification of key population sub-groups. Administrative data do not always collect the
information necessary to identify and classify the sub-groups of particular relevance for Census QA.
For example ethnicity information may not be available, or accurate (given proxy reporting) for
detailed ethnic comparisons.
The 2011 Census Data Matching Team investigated these issues through a series of data matching pilots4.
The matching pilots necessarily used data available at the time of this exploratory work, which was mainly
during late 2010. They also involved diverse geographic areas and while they provided very valuable insights
into each data pair, they did not create a coherent picture of administrative data quality patterns in different
types of local authority. To achieve this, a number of contrasting LAs were selected for case study
investigations, as described below.
4
Including matching of the following: Council Tax/ Valuation Office Agency, HESA/ 2009 Rehearsal, Address Register/
Rehearsal Census Coverage Survey addresses, Council Tax/ HESA, Council Tax/ Patient Register, Address Register/
Patient Register, Patient Register/ 2009 Rehearsal, School Census/ Rehearsal, 2009 Rehearsal Dummy Form/ Patient
Register.
6
2.1.1 Cleaning and data preparation
In addition to understanding coverage, quality and definitional issues, the administrative sources will be
processed ahead of matching, including de-duplication of records, checking variable formats, checks for
coding inconsistencies and checking the number of missing or unknown values for each variable. Addresses
may require cleaning. The Data Matching Team is using Matchcode5 for this.
Some data transformations may be necessary ahead of matching. String variables may need to be standardised
or parsed (split into their constituent parts). String variables may need to be concatenated to form a match
key. Unique identifiers will be added to each dataset for each record so that these, rather than identifying
information, will be stored alongside newly-created identifiers from other datasets for each matched record.
The matching process involves ‘blocking’ records, which amounts to selecting records as potential matches
based on a specified shared characteristic, such as address postcode. The quality of blocking variables and in
particular the levels of missingness in these fields requires careful consideration.
2.1.2 Record matching
A three-stage process of record matching will be applied, using exact matching followed by probabilistic or
score-based matching and finally a clerical search for record pairs. For both address and person matching,
exact matching involves searching within postcodes for identical house or flat number or name, together with
street name (for addresses) or first name, surname, gender and date of birth for individual records pairs.
Unmatched records will be passed through a probabilistic or score-based matching process, again using record
blocking in the first instance. Candidate pairs that are matched through this process will be given a score
based on the level of agreement between the respective matching variables on each data source. FellegiSunter methods for probabilistic matching are being considered together with Jaro-Winkler bigrams6 and a
‘tokens’-based approach7 for string comparisons. Potential matches from this second stage will be referred for
clerical adjudication if they score above a predefined threshold.
Following probabilistic or score-based matching, unmatched records will be referred to a clerical matching
team for resolution. The expectation is that contextual information gained through looking at the relevant
Census form images will help to find matching record pairs.
Data quality is paramount. Falsely matched records will lead us to underestimate undercount while falsely
unmatched records will exaggerate the population missed by Census. The right balance between reducing
unmatched residuals and match quality will be maintained through stringent matching criteria in automatic
matching routines and precise instructions for clerical matchers. Quality checks will aim to ensure that type I
errors (false positives) do not exceed one per cent (subject to review) and will involve an expert matcher
Matchcode is an address management software product produced by Capscan. It compares and ‘cleans’ addresses using
the Royal Mail Postcode Address File (PAF). Addresses are matched against the PAF with varying, user-defined
flexibility around the degree of agreement. Testing and evaluation by the Census Data Matching Team has revealed that
postcode errors (rather than missing postcode information) and errors in the first elements of the address create most
address cleaning error when using Matchcode. The main problem is loss of detail, which impacts disproportionately on
sub-divided properties such as HMOs. Retention of original unclean addresses following cleaning is therefore
recommended.
5
6
Winkler W.E., Thibaudeaux Y., An application of the Fellegi-Sunter model of record linkage to the 1990 US Decennial
Census (1991) US Bureau of the Census
7
Cohen W W., Ravikumar P., Fienburg, S E., A comparison of String Distance Metrics for Name-Matching Tasks.
http://www.cs.cmu.edu/~pradeepr/papers/ijcai03.pdf
7
checking a sample of record pairs for each clerical matcher. Given limited resources, this adherence to quality
standards means that record matching is always the last resort in the validation process.
Individuals with administrative records but no corresponding Census record will be searched for at addresses
associated by Census with their address in the administrative data. These are identified through the ‘Usual
address one year ago’, ‘Second residence’ and ‘Visitors’ usual residence’ questions.
2.1.3 Methods for the aggregate-level comparison of case study local authorities
To help understand and interpret administrative data, (relationships between different sources and how these
vary geographically), aggregate-level distributions were compared for selected local authorities (LAs).
Geographic levels of aggregation considered are LA, CCS postcode clusters and postcode-level. Quinary agesex groups of individual records are also compared.
The data matching focus on underenumeration led to the selection of four LAs where undercount is expected
to be among the highest and which are being processed early on in the Census programme. Two LAs were
selected as they also pose enumeration challenges, though not as severe as inner London and a further two
LAs were selected to represent easier to enumerate areas.
The analysis of address-based data in the case study LAs focused on areas used for the Census Coverage
Survey, given its importance for the accurate estimation of Census undercount. Data matching may be used
where Census population estimates are being queried. One approach will be to use record matching to bring
new evidence from administrative sources to bear on dual system estimation, used to correct for Census
underenumeration. Put simplistically, the Census estimates are based on a comparison of Census and Census
Coverage Survey (CCS) counts within selected clusters of postcodes in each LA. There are more CCS
postcode clusters in LAs considered hard to enumerate, and in turn the CCS sample postcodes are
concentrated in the more challenging areas. By identifying individuals unique to the Census and to the CCS
in these postcode clusters, the aim is to estimate the number (and characteristics) of those missed by both8.
Administrative data matching generates a matrix of linked information on addresses and the individuals living
in them that may bring new insights into usual residents in CCS postcode clusters. In addition, this linked
information can be used to create a proxy count of the number of occupied households, which also contributes
to the estimation of household sizes and structures in Census. The address-level analysis described in this
report therefore focuses in the first instance on address counts in administrative sources in CCS postcode
clusters, before being generalised to all postcodes within each LA.
2.2. Administrative sources used for the comparisons
This section describes key characteristics of the administrative sources used, together with notes on data
cleaning methods applied to them by the Data Matching Team.
2.2.1. Census Address Register
The Address Register underpins the 2011 Census. The register forms the post-out list for questionnaire
delivery, enables questionnaire tracking and targeted follow-up of non-responders, and is an important feed
into the quality assurance and estimation processes. The core of the Census address register is formed by a
match between the key national datasets: Royal Mail's Postcode Address File (PAF) and the National Land
and Property Gazetteer (NLPG). ), maintained by Local Government. The address register covers residential
and communal establishment addresses in England and Wales. The analysis described in this report used full
address files (household and communal establishment) from NAR14 (compiled November 2010 to be used for
printing questionnaires) incorporating updates from NAR16 (an updating file between November 2010 and
February 2011).
Abbott, O. (2009) ‘2011 UK Census Coverage Assessment and Adjustment Strategy’, Population Trends 137:25-32,
available at www.statistics.gov.uk/downloads/theme_population/PopTrends137web.pdf
8
8
9
2.2.2. Patient Register Data for England & Wales
The Patient Register contains records of patients registered with a GP who are resident in England and Wales.
The system is maintained by the National Health Service Central Register at Southport. Each record in the
register contains the patient’s NHS number, name, address (including postcode), date of birth and date of
acceptance by the Primary Care Trust (PCT). The entries on the register are updated on receipt of information
from PCTs.
The extract used in this analysis was current on 30 April 2010. A total of 21,959 records out of 58,201,824
with duplicate NHS numbers were removed from the file. Addresses were geo-referenced using Matchcode
software. Following this process, 26,941 addresses remained un-referenced. The Patient Register is a list of
individuals, most of whom share an address with others. To generate a list of addresses, the patient list was
de-duplicated using the two first address fields and postcode. Addresses that in reality are for the same
dwelling but have been recorded differently will appear at this stage as distinct addresses. This error is partly
resolved during data matching, when ‘one to many’ potential matches get resolved by clerical matchers. At
the aggregate level, the address list may remain artificially inflated. Some address list inflation in the Patient
Register will be counter-balanced by addresses missed because no-one living there is registered with a GP or
uses the NHS.
2.2.3. Customer Information System Data (DWP/HMRC) for England & Wales
The CIS was introduced by DWP to replace the Departmental Central Index (DCI) and Personal Details
Computer System (PDCS) to become the central repository of personal details for DWP and parts of HMRC.
It is essentially an interface between benefits systems, HMRC registration and other sources (e.g. DVLA) and
stores personal/biographic information and holds records for approximately 91m individuals. In its role as
DWP/HMRC master repository and providing other government departments with data on benefit
confirmations, the CIS is a highly performant system.
Information on CIS is updated by way of an overnight feed from source systems. Therefore the currency of
information held for individuals within CIS is very much dependent on interactions with source systems.
Statistical quality analysis of the CIS suggests that address information is likely to be up-to-date for pensions
and may not be updated for some time for some individuals, or is dependent on employers feeding through
changes to address information reported to them on HMRC systems. This is explored below. Data are not
'weeded' or purged - details for every individual allocated a NINo are retained indefinitely, although records
do have a death marker. There is no flag if an individual moves abroad or dies abroad.
The type of information held by CIS includes: personal details such as names, addresses, gender, date of birth
and date of death, ethnicity and language and type of benefit (start & end date of claim). In-migrants are
identified within the CIS as having a NINo allocated at a non-standard age (adult allocations).
The CIS extract used in this analysis contains 55.1 million records and includes both ‘active’ and ‘inactive’
records. Inactive records are those where there is no date of death but there has been no benefit or
employment activity record in the last three years. Record-level data are not currently available in ONS.
Figures 2 and 3 show that the CIS data for England and Wales are broadly consistent with Patient Register
data at the national level. Both the CIS and Patient Register counts exceed mid-year estimates for prime age
groups of both sexes in England and for males in Wales. This pattern of broadly consistent counts/ estimates
is not reflected in case study LAs, and suggests that further analysis is required to understand the use of
geographic information in the Patient Register and CIS for statistical purposes.
10
Figure 2 Comparison of Customer Information System, Patient Register and mid-year estimates for England.
England
Age
80
Males
Females
(count)
(count)
70
60
50
40
30
20
MYE
PR
10
CIS
900000
700000
500000
300000
100000
100000
300000
500000
700000
900000
Figure 3 Comparison of Customer Information System, Patient Register and mid-year estimates for Wales.
Wales
Age
80
Males
Females
(count)
(count)
70
60
50
40
30
20
MYE
PR
10
CIS
60000
40000
20000
0
20000
40000
60000
Figure 4 below shows time elapsed since an individual’s record was last updated in the Patient Register and
the CIS data within the sample of Hard to Count LAs. Analysis of time elapsed since last update (interaction)
provides a useful indicator of data currency. Figure 4 suggests that Patient Register data are more current than
CIS data. It should be noted that the CIS, due to its role, holds biographic data on individuals (no records are
ever deleted from the system) whereas Patient Registers are purged on a regular basis. Over a half of CIS
records have not been updated for at least five years, compared to around a quarter of Patient Register records.
11
However, it is unclear from this analysis whether those records not updated for more than five years are a
reflection of a stable population or a reflection that individuals have not interacted with the system, possibly
because they are abroad.
Figure 4 Time elapsed since last update, ‘Hard to count’ Local Authorities.
2.2.4. Electoral Registers
The electoral register is a record of the names and addresses of all people eligible to vote in elections held in
the United Kingdom. These people are listed at the address where they are currently resident and registered.
Each local authority produces their own list which is updated annually through postal canvasses.
Entitlement to be on the electoral register comes from the entitlement to vote. The right to vote in the UK
extends to all adult UK, Irish and Commonwealth Citizens who are usually resident in the UK. There are a
small number of exceptions, for example convicted prisoners. In some cases people may choose not to register
themselves. Citizens of the EU member states resident in the UK are entitled to vote in local and European
Parliament elections and therefore will be included on the register. The electoral register also includes 16 and
17 year olds who will turn 18 during the period in which the register is in force.
The variables and data quality vary between local authorities, although all registers contain names and
addresses of electors.
2.2.5. Valuation Office Agency Council Tax Data for England & Wales
The Valuation Office Agency (VOA) maintains valuation lists of addresses in England and Wales for the
purposes of council tax bandings and non domestic rates. To maintain accurate valuation lists, the VOA has
always needed to keep the data it holds about a property up to date. This is a key part of the VOA’s
responsibilities. The valuation lists are updated every week from local authorities (VOA staff input changes on
a daily basis). The valuation lists cover all domestic properties and composite dwellings which are used for
both domestic and business use, but exclude business only properties. Council tax exemption and discount
code information is not included within VOA data. VOA data were not available in time for incorporation in
12
this report.
13
2.2.6. Local Authority Council Tax Data
The Census QA team holds low aggregate extracts of council tax data for a number of local authorities who
have responded to ONS’ request for this information. A request was made in January 2011 to all local
authorities for counts of dwellings from Council Tax data (including exemption and discount codes) at unit
postcode level (or higher if unavailable). The reference date for these data was 27 March 2011.
Council Tax Discounts are available for single person households, students, empty homes, second homes,
among others. Exemptions are available for vacant and unoccupied dwellings and, for example, student halls
and armed forces’ accommodation.
As these data were compiled by different LAs it is likely that there will be degree of inconsistency between
them. For example, it is unclear whether all LAs have included counts of properties that are ‘long-term
empty’. A further quality issue that will affect all LAs to varying degrees is that properties will be doublecounted if they attract more than one discount or exemption code, but also, subject to households claiming the
discount which may vary between LAs.
2.3 The CCS sample in the eight case-study LAs
The eight LAs were selected for case study as they pose contrasting challenges for Census enumeration. This
is reflected in the composition of their Census Coverage Survey sample, which requires fuller description.
The CCS sample uses all addresses within around half of the postcodes within Output Areas that are stratified
by LA and by a national Hard to Count (HtC) Index. The HtC Index is constructed using a combination of
variables including:




The proportion of people claiming Income Support or Job Seekers’ Allowance
The proportion of people who are not ‘White-British’
A measure of the relative house price in each LA
A measure of dwelling density
Sample selection within the LA/HtC strata depends on a number of factors, including 2001 Census coverage
patterns and levels and LA size. There is a minimum of one Output Area per LA HtC stratum and a maximum
of 60 OAs within an LA (with less than a handful of deliberate exceptions), to protect against 2001 patterns
over-determining 2011 undercount adjustment. Thus the HtC distribution of output areas within an LA
broadly reflects its general composition. Index values range between 1 and 5, with 5 being the hardest to
count areas. The national distribution of the HtC Index is 40, 40, 10, 8 and 2 per cent (easiest first). The four
hard-to-count LAs for case study had CCS sample output areas in HtC groups 4 and 5. The easy- to- count
LAs had CCS sample output areas in HtC groups 1 and 2, while the medium-to-count LAs spanned HtC
groups 1 to 4.
14
3. Findings from the LA case studies: addresses
Here we present selected findings from the analysis of administrative data for eight case-study LAs.
3.1 Counts of addresses in administrative data sources
The number of addresses in the Patient Register was derived through de-duplication as described in Section 2.
Unless all duplicates were identified, the list of Patient Register addresses will be inflated. However, Patient
Register address counts should be lower than Council Tax and Address Register counts as they exclude
addresses where residents (or visitors) are not registered with the NHS. Likewise, Electoral Register address
counts should be lower than Council Tax and Address Register counts as they exclude addresses with no
registered electors.
LA-provided Council Tax data were only available for five of the eight LAs and the analysis presented here
includes all addresses, including those receiving discounts and exemptions.
It is important to note the different reference dates for each source. Differences between sources will in part
reflect real changes over time, particularly in areas with more residential development.
Figure 5
Range of address counts in administrative data for case study Local Authorities
Counts
160000
140000
120000
100000
80000
60000
40000
20000
0
ETC LA 1
ETC LA2
HTC LA1
HTC LA2
AR
CT
HTC LA3
PR
HTC LA4
MTC LA1
MTC LA2
ER
Figure 5 shows the range of address counts in each source in the eight case study LAs.
Both the Address Register and Council Tax data include both occupied and unoccupied addresses. The
Patient Register data and Electoral Register should only include addresses with patients and electors,
respectively. Time lags in de-registering and re-registering at a new address blur the distinction to some
degree.
The two sources that agree most closely are the Patient Register and Council Tax, where available. The
Electoral Register has fewest addresses in every LA, reflecting that not all dwellings contain people eligible
and registered to vote in elections. The Address Register has consistently the highest count of addresses,
reflecting the conservatism of its creation. The difference between the Address Register and Electoral
Register is highest in HTC LA1, with 64 per cent more addresses on the Address Register here. Updates to
15
the Address Register from the Questionnaire Tracking system will ‘deactivate’ addresses and should result in
greater convergence between address-based sources.
Agreement between the sources is highest in ETC LA1 and MTC LA1, as shown in Figure 5. There is least
agreement in HTC LA1, followed by HTC LA2. Arguably the Electoral Register is artificially inflating the
address count range, since it does not have universal coverage. Removing the Electoral Register from the
range leaves HTC LA1 and HTC LA2 with the biggest differences between sources and ETC LA2 with the
smallest, but re-ranks the other LAs.
3.2 Address-level data: hard- and easy- to- count LAs compared
Figure 6 compares address counts in the Address, Patient and Electoral Registers within CCS postcode
clusters in a hard-to-count LA. Data were ranked in order of the number of addresses in the Address Register.
The Address Register contains more addresses than either of the other source in the majority of CCS
postcodes clusters. Mid-way through the Census field operation we re-ran this analysis, omitting addresses
that were deactivated by collectors (for example because they were non-existent or duplicates). The gap
between the Address Register and other sources approximately halved as a result of deactivations. Excess
address counts in the Patient Register are due to incomplete address matching, i.e. were likely to contain some
element of duplication.
This analysis allowed us to identify postcodes where the divergences between sources were the greatest. One
notable excess Electoral Register count was a complete sequence of numbers in a block of flats. Further
investigation would reveal if this was an Address Register omission or an empty block, for example being
refurbished. Conversely, two examples where administrative data counts were substantially lower than
Address Register counts reflected new developments captured by the Address Register, but not yet inhabited.
These examples illustrate the value of low-level aggregate address analysis for identifying and explaining data
discrepancies.
Figure 6
Address count comparison within Census Coverage Survey postcode clusters
Hard to count local authority
Counts
200
180
160
140
120
100
80
60
40
20
0
AR
PR
CT
16
ER
Figure 7 compares address counts in the Address, Patient and Electoral Registers within CCS postcode
clusters in an easy-to-count LA. Since the easy-to-count areas pose relatively few enumeration challenges,
they have a smaller CCS sample size and fewer postcode clusters. The address counts within the three sources
are in close agreement, with the Address Register consistently the highest.
Figure 7
Address count comparison within Census Coverage Survey postcode clusters
Easy to count local authority
Counts
200
180
160
140
120
100
80
60
40
20
0
AR
PR
ER
17
4. Findings from the LA case studies: individuals
At the individual level, Patient Register and DWP Customer Information System data were compared to ONS
Mid-year population estimates (MYEs). Figure 8 summarises the range of person counts in each LA.
Differences between sources will in part reflect real changes over time, particularly in areas with more mobile
populations such as Inner London.
There tends to be closer agreement between the sources on the person count than there was for addresses,
though the latter is attenuated with the removal of the Electoral Register from the address count analysis.
HTC LA4 provides a stark exception, with 1.80 times as many records on the CIS compared to the Patient
Register,
A striking characteristic of this comparison is that the relationship between CIS counts and the Patient
Register is inconsistent. CIS counts are higher in HTC LA2, HTC LA4, MTC LA1 and MTC LA2 and lower
in ETC LA1, ETC LA2, HTC LA1and HTC LA3. There is no obvious explanation for this pattern, which is
explored in more detail in the LA profiles that follow. However, as noted previously the CIS, due to its role,
holds biographic data on individuals (no records are ever deleted from the system) whereas Patient Registers
are purged on a regular basis and therefore individuals may be removed from Patient Registers.
Figure 8
Range of person counts in administrative data for case study Local Authorities
Counts
450000
400000
350000
300000
250000
200000
150000
100000
50000
0
ETC LA1
ETC LA2
HTC LA1
HTC LA2
PR
HTC LA3
CIS
MYE
18
HTC LA4
MTC LA1
MTC LA2
4.1 Individual-level data: hard- and easy- to- count LAs compared
Figure 9 compares the Patient Register, DWP Customer Information System and ONS Mid-year estimates by
age and sex in a hard to count LA. The CIS counts exceed Patient Register and MYEs at all ages over around
28 years. The slight bulge around pension age in the Patient Register and MYEs is accentuated in the CIS.
Further data exploration is required to understand this variance.
The CIS has fewer under-18s than the Patient Register, with around half as many infants. A tentative
explanation would be low levels of child benefit claims or those under-18s who have entered the country as
migrants but have had no need to register for a National Insurance number.
There is consistency between sources in the counts of both males and females aged approximately 18 to 30,
though this could to some degree incorporate similar patterns of list inflation in both the Patient Register and
CIS data. Data matching would reveal the extent to which these corresponding aggregates represent the same
people. There is a steep gradient reflecting grouping and possible inflation of over-84s.
Figure 9
Single year age-sex distribution using Patient Register, Customer Information System data and Midyear Estimates counts
Hard to count local
Age
Males
Females
80
70
70
60
60
50
50
40
40
30
30
20
20
10
10
0
5000
authority
4000
3000
2000
1000
MYE
80
0
1000
CIS
19
PR
2000
3000
4000
0
5000
Surprisingly there was in consistency in the person counts for the easy to count LA shown in Figure 10. The
counts for the CIS were lower than those for the Patient Register and MYE at all ages. Further investigation is
required to understand these differences.
Figure 10
Single year age-sex distribution using Patient Register, Customer Information System data and Midyear Estimates counts
Easy to count local authority
Age
Males
Females
80
70
70
60
60
50
50
40
40
30
30
20
20
10
10
0
5000
4000
3000
2000
1000
MYE
80
0
1000
CIS
PR
20
2000
3000
4000
0
5000
5. Conclusions, recommendations and next steps
5.1. Conclusions
We begin by drawing together the main findings from the LA case studies, distinguishing between addressbased and individual-level data. For addresses, the range of counts in the different administrative sources was
highest in hard-to-count, and London LAs. This reflects high levels of population flux in these areas and the
likelihood that administrative sources are not recording out-migration in a timely way. The divergence
between counts was lowest in rural areas, reflecting greater population stability in these areas. The Address
Register was typically higher than other address-based sources, though address deactivations as a result of
Census field operations are reducing the difference. The Electoral Register typically has the lowest address
counts, because it excludes addresses where there are no eligible electors. This makes any excess count in the
Electoral Register compared with the Address Register particularly useful for further investigation. The
Patient Register address counts used in the analysis described here were inflated because of incomplete
address matching within the Register.
For individuals, there was again most divergence between administrative sources in hard-to-count LAs,
particularly London. There was less divergence in easy-to-count and rural areas, with differences
concentrated among young adults. However, there are inconsistent patterns of difference between the
individual-level administrative datasets raise which require further investigation.
Data preparation and matching between administrative sources in anticipation of the data QA process will
soon give way to real data anomalies identified by the Census Quality Team and referred to the Data Matching
Team for further investigation. This matching, particularly within CCS postcode clusters where Census
information is most complete, will allow us to assess whether:



Aggregate-level comparisons reflect the same individuals
Residual administrative data records from matching can be explained through Census
‘associated addresses’
The extent to which Patient Register, School Census and HESA matching can validate or
improve the Census population estimates.
21
Download