GSS Methodology Symposium, 6 July 2011 Abstract Title: Linkage of administrative sources to validate Census population estimates This paper describes how record-level administrative data are being used to help validate Census population estimates. The innovative data matching approach described here is just one component of a comprehensive data quality assurance strategy for the 2011 Census. The focus here is the validation of Census estimates of undercount. A matrix of linked information on addresses and the people living in them, drawn from administrative data, can shed new light on households and individuals that Census processes may have missed. The administrative sources will be described, together with the methods being used to clean and link them. This paper also describes how new Census information on second residences and visitors, together with addresses one year ago, will support the interpretation of linked records and of unmatched residuals. These results and methods draw on a series of data matching pilots designed to strengthen our understanding of available administrative sources and their quality1. Keywords: Census, Administrative Sources, record matching Louisa Blackwell Louisa.blackwell@ons.gsi.gov.uk 0132944 4539 Nicky Rogers Nicola.rogers@ons.gsi.gov.uk 0132944 4866 1 Quality of administrative data sources was assessed in terms of their statistical quality and their use to validate or improve the Census population estimates. 1 1. Introduction and background 1.1. Introduction This paper is concerned with the use of administrative data sources to validate 2011 Census population estimates. Administrative data will be used as comparators for Census counts and estimates, for detailed investigations of data discrepancies at low-level geographies and for record-level matching. Record matching will only be invoked where there are discrepancies between Census population estimates and comparator datasets that cannot be explained at the aggregate level. The 2011 Census data quality assurance (QA) strategy and QA Methodology2 sets out a wide range of checks and data comparisons that will ensure that Census estimates at local authority, regional and national level are accurate. These are summarised in the ‘Background’ section below. We then describe the administrative data to be used in the QA process. A critical consideration in the statistical use of administrative data is that this information was primarily collected for other purposes. Data preparation for aggregate-level comparison and for data matching involves data cleaning as well as standardisation and synchronisation, which we describe in the ‘Methods’ section below. Synchronising Census and administrative sources can be challenging, given the presence of source-specific time lags between an individual’s interaction with the administrative system, and the availability of that information for analysis at ONS. The 2011 Census Data Matching Team has conducted a series of case studies to investigate how the statistical quality issues associated with using administrative sources to validate Census population estimates vary in different types of local authority (LA). We present some examples from these case studies, as they both highlight the utility of administrative sources for probing data discrepancies, and their limitations, particularly in areas that have high population churn. In examining administrative data for the case study LAs, we asked: 1. To what extent do the counts of addresses vary between the Address Register, Council Tax, the Patient Register, Electoral Register and Valuation Office Agency data? 2. Do these patterns vary in different types of local authority? 3. How does the coverage of administrative sources in Inner London compare with urban and rural areas with less population flux? 4. What can the aggregate-level comparisons tell us about the data matching strategy to be followed? 5. What data quality issues can be tackled ahead of data quality assurance and data matching? 6. Are some sources better at counting the number and characteristics of particular age-sex groups than others? The results of this analysis have shaped the approach taken to record-level matching. Conversely, record matching will shed further light on these results. For example, the only way to check that consistent address or person counts in different sources refer to the same entities is to match the individual records and evaluate the unmatched residuals. Resource constraints meant that this analysis was restricted to eight local authorities. Data matching will only be used in local authorities that pose the greatest enumeration challenge. To capture the salient characteristics of administrative sources in this type of area, and to gain early substantive insights into LAs likely to require supplementary analysis, four of the eight selected LAs are in Inner London. Two LAs that pose less extreme but pose their own unique enumeration challenges were also included. By contrast, two of the least challenging LAs were also analysed. Address-based data comparisons focus on data discrepancies in Census Coverage Survey (CCS) postcode clusters and in all postcodes within each LA. The focus on CCS postcode clusters is because when a 2 Office for National Statistics 2011 Census Data Quality Assurance Strategy June 2009 2 population estimate for a particular LA is challenged and the issue(s) cannot be resolved with reference to aggregate-level data, the data matching team will consider whether administrative sources can shed new light on the areas within the LA that were used for dual system estimation. In CCS postcodes we also have most information on Census-day populations. Person-level data were disaggregated by age and sex so that the representativeness of the administrative sources for different population sub-groups and for males and females within these could be explored. There are gendered differences in interactions with the NHS Register, for example young women are more likely to register with and consult GPs than young men. This could create an undercount in the Patient Register of young men. CIS address data of women bringing up children could very easily be more up to date than that for men if benefit or tax credits are being claimed. Such patterns may be visible, particularly when compared against ONS’ mid-year population estimates. 1.2. Background Administrative data will be used at varying levels of aggregation to quality assure Census population estimates. An overview of the Census data quality assurance strategy is provided in Wroth-Smith et al (2011)3. This is summarised in Figure 1. The quality assurance process is organised into three stages: core checks on the data, supplementary checks and, for data anomalies that cannot be understood using aggregatelevel data, record matching. Core and supplementary checks will be carried out at local authority and lower levels of geography. There will also be regional and national-level checks, using data comparators and demographic indicators, to ensure that the emerging and final regional and national estimates are accurate. Core and supplementary analysis may include a focus on population sub-groups that pose particular enumeration challenges, such as migrants or students, or on groups that the checking process has highlighted as under-or over-represented. Wroth-Smith, J., Abbott, O., Compton, G. and Benton, P. (2011) ‘Quality Assuring the 2011 Census Population Estimates’, Population Trends 143: 1:9. 3 3 Figure 1 Quality Assurance Overview It is efficient to compare Census and administrative data at the aggregate level because data aggregates can be anonymised and comparisons can be quickly automated to deliver easy-to-interpret results. However, there could be aggregate-level discrepancies between two or more sources that cannot be explained. The differences could reflect a localised breakdown in Census field or estimation processes. To understand and reconcile such differences, new insights may be gained by matching records for addresses and individuals within a particular area. Matching administrative sources to Census data will: Identify addresses missed or misclassified by Census Inform the alternative household count, used for coverage estimation Assess dual system estimates through comparisons of address and person counts in CCS postcode clusters Validate address and person counts in small geographic pockets with high Census non-response Identify people that Census missed within counted households Check counts and estimates for key population sub-groups (such as babies, school children, higher education students and immigrants) Data matching will be only undertaken within the secure and legally compliant business systems and processes that have been built for 2011 Census. 4 1.3 Guiding principles for record matching The Census data matching strategy has been developed to focus on underenumeration, that is households and individuals that may have been missed by Census processes. Other issues such as Census overcount are dealt with in other strands of Census estimation and the quality assurance strategy. In addition, newly available 2011 Census information on date of arrival, intended length of stay, second residences, visitors (and their usual addresses), together with respondents’ usual address one year ago will be used to help reconcile unmatched residuals from administrative data. Interpretation of the results from data matching will be an iterative and empirically-driven process. The Address Register and its validation, Census enumeration, the Census Coverage Survey and Census data processing provide a uniquely rich snapshot of people and places in March 2011 against which administrative data can be compared. Through matching, knowledge will be gained incrementally. For example, address matching ahead of Census will structure and simplify individual-level matching of Census response data. The characteristics of administrative records linked to Census returns will in turn aid the interpretation of records in the unmatched residuals. Data matching and the results will be continuously updated as new data and information become available. For example, Higher Education Statistics Agency (HESA) data relevant to the 2011 Census day only become available in January 2012. Until that date, postcode data for 2010 will be used and will be updated with the later release. Postcode data will be used because HESA data are not supplied with full addresses, only postcode. We will be unable to use person-based HESA data before January 2012 because data currently available to ONS relate to the 2009/10 academic year. The data matching strategy uses data sources that are available to ONS at record level, either through data sharing agreements with the respective data owners or because these data are generally available. Some data (for example Council Tax data) are only available as aggregates, supplied directly by LAs so their use is restricted to aggregate-level comparison of addresses. Microdata that become newly available during data matching will be integrated, as far as possible into the methodology. This paper does not consider microdata to be used in data matching that cover subsets of the population, such as HESA data, School Census data and the DWP Migrant Worker Scan. 5 2. Methods 2.1. Methods for Census record matching It is critical to build an understanding of the strengths and limitations of Administrative sources, with particular reference to comparability with Census definitions. Some key considerations: Population coverage. For example, the School Census captures important information for children in state-funded schools, but not those in Independent schools or 16-18 year-olds in Further Education colleges. Aligning reference periods between administrative sources and Census day. Time lags in the administrative systems mean there is sometimes a considerable delay in being able to access data for the relevant period. For example, HESA data for March 27th 2011 (Census day) will not be available for analysis in ONS until January 2012. Definitional differences need to be identified and understood. Census uses questions on arrival in the UK and intended length of stay to identify short-term migrants (present for 3-12 months), who will be excluded from the ‘usually resident’ population. However administrative sources do not collect this information and may therefore appear inflated when compared to Census counts and estimates. Presence and completeness of data analysis and matching information. Variables such as name, address, date of birth and sex need to available, complete and of good quality to support data matching. Identification of key population sub-groups. Administrative data do not always collect the information necessary to identify and classify the sub-groups of particular relevance for Census QA. For example ethnicity information may not be available, or accurate (given proxy reporting) for detailed ethnic comparisons. The 2011 Census Data Matching Team investigated these issues through a series of data matching pilots4. The matching pilots necessarily used data available at the time of this exploratory work, which was mainly during late 2010. They also involved diverse geographic areas and while they provided very valuable insights into each data pair, they did not create a coherent picture of administrative data quality patterns in different types of local authority. To achieve this, a number of contrasting LAs were selected for case study investigations, as described below. 4 Including matching of the following: Council Tax/ Valuation Office Agency, HESA/ 2009 Rehearsal, Address Register/ Rehearsal Census Coverage Survey addresses, Council Tax/ HESA, Council Tax/ Patient Register, Address Register/ Patient Register, Patient Register/ 2009 Rehearsal, School Census/ Rehearsal, 2009 Rehearsal Dummy Form/ Patient Register. 6 2.1.1 Cleaning and data preparation In addition to understanding coverage, quality and definitional issues, the administrative sources will be processed ahead of matching, including de-duplication of records, checking variable formats, checks for coding inconsistencies and checking the number of missing or unknown values for each variable. Addresses may require cleaning. The Data Matching Team is using Matchcode5 for this. Some data transformations may be necessary ahead of matching. String variables may need to be standardised or parsed (split into their constituent parts). String variables may need to be concatenated to form a match key. Unique identifiers will be added to each dataset for each record so that these, rather than identifying information, will be stored alongside newly-created identifiers from other datasets for each matched record. The matching process involves ‘blocking’ records, which amounts to selecting records as potential matches based on a specified shared characteristic, such as address postcode. The quality of blocking variables and in particular the levels of missingness in these fields requires careful consideration. 2.1.2 Record matching A three-stage process of record matching will be applied, using exact matching followed by probabilistic or score-based matching and finally a clerical search for record pairs. For both address and person matching, exact matching involves searching within postcodes for identical house or flat number or name, together with street name (for addresses) or first name, surname, gender and date of birth for individual records pairs. Unmatched records will be passed through a probabilistic or score-based matching process, again using record blocking in the first instance. Candidate pairs that are matched through this process will be given a score based on the level of agreement between the respective matching variables on each data source. FellegiSunter methods for probabilistic matching are being considered together with Jaro-Winkler bigrams6 and a ‘tokens’-based approach7 for string comparisons. Potential matches from this second stage will be referred for clerical adjudication if they score above a predefined threshold. Following probabilistic or score-based matching, unmatched records will be referred to a clerical matching team for resolution. The expectation is that contextual information gained through looking at the relevant Census form images will help to find matching record pairs. Data quality is paramount. Falsely matched records will lead us to underestimate undercount while falsely unmatched records will exaggerate the population missed by Census. The right balance between reducing unmatched residuals and match quality will be maintained through stringent matching criteria in automatic matching routines and precise instructions for clerical matchers. Quality checks will aim to ensure that type I errors (false positives) do not exceed one per cent (subject to review) and will involve an expert matcher Matchcode is an address management software product produced by Capscan. It compares and ‘cleans’ addresses using the Royal Mail Postcode Address File (PAF). Addresses are matched against the PAF with varying, user-defined flexibility around the degree of agreement. Testing and evaluation by the Census Data Matching Team has revealed that postcode errors (rather than missing postcode information) and errors in the first elements of the address create most address cleaning error when using Matchcode. The main problem is loss of detail, which impacts disproportionately on sub-divided properties such as HMOs. Retention of original unclean addresses following cleaning is therefore recommended. 5 6 Winkler W.E., Thibaudeaux Y., An application of the Fellegi-Sunter model of record linkage to the 1990 US Decennial Census (1991) US Bureau of the Census 7 Cohen W W., Ravikumar P., Fienburg, S E., A comparison of String Distance Metrics for Name-Matching Tasks. http://www.cs.cmu.edu/~pradeepr/papers/ijcai03.pdf 7 checking a sample of record pairs for each clerical matcher. Given limited resources, this adherence to quality standards means that record matching is always the last resort in the validation process. Individuals with administrative records but no corresponding Census record will be searched for at addresses associated by Census with their address in the administrative data. These are identified through the ‘Usual address one year ago’, ‘Second residence’ and ‘Visitors’ usual residence’ questions. 2.1.3 Methods for the aggregate-level comparison of case study local authorities To help understand and interpret administrative data, (relationships between different sources and how these vary geographically), aggregate-level distributions were compared for selected local authorities (LAs). Geographic levels of aggregation considered are LA, CCS postcode clusters and postcode-level. Quinary agesex groups of individual records are also compared. The data matching focus on underenumeration led to the selection of four LAs where undercount is expected to be among the highest and which are being processed early on in the Census programme. Two LAs were selected as they also pose enumeration challenges, though not as severe as inner London and a further two LAs were selected to represent easier to enumerate areas. The analysis of address-based data in the case study LAs focused on areas used for the Census Coverage Survey, given its importance for the accurate estimation of Census undercount. Data matching may be used where Census population estimates are being queried. One approach will be to use record matching to bring new evidence from administrative sources to bear on dual system estimation, used to correct for Census underenumeration. Put simplistically, the Census estimates are based on a comparison of Census and Census Coverage Survey (CCS) counts within selected clusters of postcodes in each LA. There are more CCS postcode clusters in LAs considered hard to enumerate, and in turn the CCS sample postcodes are concentrated in the more challenging areas. By identifying individuals unique to the Census and to the CCS in these postcode clusters, the aim is to estimate the number (and characteristics) of those missed by both8. Administrative data matching generates a matrix of linked information on addresses and the individuals living in them that may bring new insights into usual residents in CCS postcode clusters. In addition, this linked information can be used to create a proxy count of the number of occupied households, which also contributes to the estimation of household sizes and structures in Census. The address-level analysis described in this report therefore focuses in the first instance on address counts in administrative sources in CCS postcode clusters, before being generalised to all postcodes within each LA. 2.2. Administrative sources used for the comparisons This section describes key characteristics of the administrative sources used, together with notes on data cleaning methods applied to them by the Data Matching Team. 2.2.1. Census Address Register The Address Register underpins the 2011 Census. The register forms the post-out list for questionnaire delivery, enables questionnaire tracking and targeted follow-up of non-responders, and is an important feed into the quality assurance and estimation processes. The core of the Census address register is formed by a match between the key national datasets: Royal Mail's Postcode Address File (PAF) and the National Land and Property Gazetteer (NLPG). ), maintained by Local Government. The address register covers residential and communal establishment addresses in England and Wales. The analysis described in this report used full address files (household and communal establishment) from NAR14 (compiled November 2010 to be used for printing questionnaires) incorporating updates from NAR16 (an updating file between November 2010 and February 2011). Abbott, O. (2009) ‘2011 UK Census Coverage Assessment and Adjustment Strategy’, Population Trends 137:25-32, available at www.statistics.gov.uk/downloads/theme_population/PopTrends137web.pdf 8 8 9 2.2.2. Patient Register Data for England & Wales The Patient Register contains records of patients registered with a GP who are resident in England and Wales. The system is maintained by the National Health Service Central Register at Southport. Each record in the register contains the patient’s NHS number, name, address (including postcode), date of birth and date of acceptance by the Primary Care Trust (PCT). The entries on the register are updated on receipt of information from PCTs. The extract used in this analysis was current on 30 April 2010. A total of 21,959 records out of 58,201,824 with duplicate NHS numbers were removed from the file. Addresses were geo-referenced using Matchcode software. Following this process, 26,941 addresses remained un-referenced. The Patient Register is a list of individuals, most of whom share an address with others. To generate a list of addresses, the patient list was de-duplicated using the two first address fields and postcode. Addresses that in reality are for the same dwelling but have been recorded differently will appear at this stage as distinct addresses. This error is partly resolved during data matching, when ‘one to many’ potential matches get resolved by clerical matchers. At the aggregate level, the address list may remain artificially inflated. Some address list inflation in the Patient Register will be counter-balanced by addresses missed because no-one living there is registered with a GP or uses the NHS. 2.2.3. Customer Information System Data (DWP/HMRC) for England & Wales The CIS was introduced by DWP to replace the Departmental Central Index (DCI) and Personal Details Computer System (PDCS) to become the central repository of personal details for DWP and parts of HMRC. It is essentially an interface between benefits systems, HMRC registration and other sources (e.g. DVLA) and stores personal/biographic information and holds records for approximately 91m individuals. In its role as DWP/HMRC master repository and providing other government departments with data on benefit confirmations, the CIS is a highly performant system. Information on CIS is updated by way of an overnight feed from source systems. Therefore the currency of information held for individuals within CIS is very much dependent on interactions with source systems. Statistical quality analysis of the CIS suggests that address information is likely to be up-to-date for pensions and may not be updated for some time for some individuals, or is dependent on employers feeding through changes to address information reported to them on HMRC systems. This is explored below. Data are not 'weeded' or purged - details for every individual allocated a NINo are retained indefinitely, although records do have a death marker. There is no flag if an individual moves abroad or dies abroad. The type of information held by CIS includes: personal details such as names, addresses, gender, date of birth and date of death, ethnicity and language and type of benefit (start & end date of claim). In-migrants are identified within the CIS as having a NINo allocated at a non-standard age (adult allocations). The CIS extract used in this analysis contains 55.1 million records and includes both ‘active’ and ‘inactive’ records. Inactive records are those where there is no date of death but there has been no benefit or employment activity record in the last three years. Record-level data are not currently available in ONS. Figures 2 and 3 show that the CIS data for England and Wales are broadly consistent with Patient Register data at the national level. Both the CIS and Patient Register counts exceed mid-year estimates for prime age groups of both sexes in England and for males in Wales. This pattern of broadly consistent counts/ estimates is not reflected in case study LAs, and suggests that further analysis is required to understand the use of geographic information in the Patient Register and CIS for statistical purposes. 10 Figure 2 Comparison of Customer Information System, Patient Register and mid-year estimates for England. England Age 80 Males Females (count) (count) 70 60 50 40 30 20 MYE PR 10 CIS 900000 700000 500000 300000 100000 100000 300000 500000 700000 900000 Figure 3 Comparison of Customer Information System, Patient Register and mid-year estimates for Wales. Wales Age 80 Males Females (count) (count) 70 60 50 40 30 20 MYE PR 10 CIS 60000 40000 20000 0 20000 40000 60000 Figure 4 below shows time elapsed since an individual’s record was last updated in the Patient Register and the CIS data within the sample of Hard to Count LAs. Analysis of time elapsed since last update (interaction) provides a useful indicator of data currency. Figure 4 suggests that Patient Register data are more current than CIS data. It should be noted that the CIS, due to its role, holds biographic data on individuals (no records are ever deleted from the system) whereas Patient Registers are purged on a regular basis. Over a half of CIS records have not been updated for at least five years, compared to around a quarter of Patient Register records. 11 However, it is unclear from this analysis whether those records not updated for more than five years are a reflection of a stable population or a reflection that individuals have not interacted with the system, possibly because they are abroad. Figure 4 Time elapsed since last update, ‘Hard to count’ Local Authorities. 2.2.4. Electoral Registers The electoral register is a record of the names and addresses of all people eligible to vote in elections held in the United Kingdom. These people are listed at the address where they are currently resident and registered. Each local authority produces their own list which is updated annually through postal canvasses. Entitlement to be on the electoral register comes from the entitlement to vote. The right to vote in the UK extends to all adult UK, Irish and Commonwealth Citizens who are usually resident in the UK. There are a small number of exceptions, for example convicted prisoners. In some cases people may choose not to register themselves. Citizens of the EU member states resident in the UK are entitled to vote in local and European Parliament elections and therefore will be included on the register. The electoral register also includes 16 and 17 year olds who will turn 18 during the period in which the register is in force. The variables and data quality vary between local authorities, although all registers contain names and addresses of electors. 2.2.5. Valuation Office Agency Council Tax Data for England & Wales The Valuation Office Agency (VOA) maintains valuation lists of addresses in England and Wales for the purposes of council tax bandings and non domestic rates. To maintain accurate valuation lists, the VOA has always needed to keep the data it holds about a property up to date. This is a key part of the VOA’s responsibilities. The valuation lists are updated every week from local authorities (VOA staff input changes on a daily basis). The valuation lists cover all domestic properties and composite dwellings which are used for both domestic and business use, but exclude business only properties. Council tax exemption and discount code information is not included within VOA data. VOA data were not available in time for incorporation in 12 this report. 13 2.2.6. Local Authority Council Tax Data The Census QA team holds low aggregate extracts of council tax data for a number of local authorities who have responded to ONS’ request for this information. A request was made in January 2011 to all local authorities for counts of dwellings from Council Tax data (including exemption and discount codes) at unit postcode level (or higher if unavailable). The reference date for these data was 27 March 2011. Council Tax Discounts are available for single person households, students, empty homes, second homes, among others. Exemptions are available for vacant and unoccupied dwellings and, for example, student halls and armed forces’ accommodation. As these data were compiled by different LAs it is likely that there will be degree of inconsistency between them. For example, it is unclear whether all LAs have included counts of properties that are ‘long-term empty’. A further quality issue that will affect all LAs to varying degrees is that properties will be doublecounted if they attract more than one discount or exemption code, but also, subject to households claiming the discount which may vary between LAs. 2.3 The CCS sample in the eight case-study LAs The eight LAs were selected for case study as they pose contrasting challenges for Census enumeration. This is reflected in the composition of their Census Coverage Survey sample, which requires fuller description. The CCS sample uses all addresses within around half of the postcodes within Output Areas that are stratified by LA and by a national Hard to Count (HtC) Index. The HtC Index is constructed using a combination of variables including: The proportion of people claiming Income Support or Job Seekers’ Allowance The proportion of people who are not ‘White-British’ A measure of the relative house price in each LA A measure of dwelling density Sample selection within the LA/HtC strata depends on a number of factors, including 2001 Census coverage patterns and levels and LA size. There is a minimum of one Output Area per LA HtC stratum and a maximum of 60 OAs within an LA (with less than a handful of deliberate exceptions), to protect against 2001 patterns over-determining 2011 undercount adjustment. Thus the HtC distribution of output areas within an LA broadly reflects its general composition. Index values range between 1 and 5, with 5 being the hardest to count areas. The national distribution of the HtC Index is 40, 40, 10, 8 and 2 per cent (easiest first). The four hard-to-count LAs for case study had CCS sample output areas in HtC groups 4 and 5. The easy- to- count LAs had CCS sample output areas in HtC groups 1 and 2, while the medium-to-count LAs spanned HtC groups 1 to 4. 14 3. Findings from the LA case studies: addresses Here we present selected findings from the analysis of administrative data for eight case-study LAs. 3.1 Counts of addresses in administrative data sources The number of addresses in the Patient Register was derived through de-duplication as described in Section 2. Unless all duplicates were identified, the list of Patient Register addresses will be inflated. However, Patient Register address counts should be lower than Council Tax and Address Register counts as they exclude addresses where residents (or visitors) are not registered with the NHS. Likewise, Electoral Register address counts should be lower than Council Tax and Address Register counts as they exclude addresses with no registered electors. LA-provided Council Tax data were only available for five of the eight LAs and the analysis presented here includes all addresses, including those receiving discounts and exemptions. It is important to note the different reference dates for each source. Differences between sources will in part reflect real changes over time, particularly in areas with more residential development. Figure 5 Range of address counts in administrative data for case study Local Authorities Counts 160000 140000 120000 100000 80000 60000 40000 20000 0 ETC LA 1 ETC LA2 HTC LA1 HTC LA2 AR CT HTC LA3 PR HTC LA4 MTC LA1 MTC LA2 ER Figure 5 shows the range of address counts in each source in the eight case study LAs. Both the Address Register and Council Tax data include both occupied and unoccupied addresses. The Patient Register data and Electoral Register should only include addresses with patients and electors, respectively. Time lags in de-registering and re-registering at a new address blur the distinction to some degree. The two sources that agree most closely are the Patient Register and Council Tax, where available. The Electoral Register has fewest addresses in every LA, reflecting that not all dwellings contain people eligible and registered to vote in elections. The Address Register has consistently the highest count of addresses, reflecting the conservatism of its creation. The difference between the Address Register and Electoral Register is highest in HTC LA1, with 64 per cent more addresses on the Address Register here. Updates to 15 the Address Register from the Questionnaire Tracking system will ‘deactivate’ addresses and should result in greater convergence between address-based sources. Agreement between the sources is highest in ETC LA1 and MTC LA1, as shown in Figure 5. There is least agreement in HTC LA1, followed by HTC LA2. Arguably the Electoral Register is artificially inflating the address count range, since it does not have universal coverage. Removing the Electoral Register from the range leaves HTC LA1 and HTC LA2 with the biggest differences between sources and ETC LA2 with the smallest, but re-ranks the other LAs. 3.2 Address-level data: hard- and easy- to- count LAs compared Figure 6 compares address counts in the Address, Patient and Electoral Registers within CCS postcode clusters in a hard-to-count LA. Data were ranked in order of the number of addresses in the Address Register. The Address Register contains more addresses than either of the other source in the majority of CCS postcodes clusters. Mid-way through the Census field operation we re-ran this analysis, omitting addresses that were deactivated by collectors (for example because they were non-existent or duplicates). The gap between the Address Register and other sources approximately halved as a result of deactivations. Excess address counts in the Patient Register are due to incomplete address matching, i.e. were likely to contain some element of duplication. This analysis allowed us to identify postcodes where the divergences between sources were the greatest. One notable excess Electoral Register count was a complete sequence of numbers in a block of flats. Further investigation would reveal if this was an Address Register omission or an empty block, for example being refurbished. Conversely, two examples where administrative data counts were substantially lower than Address Register counts reflected new developments captured by the Address Register, but not yet inhabited. These examples illustrate the value of low-level aggregate address analysis for identifying and explaining data discrepancies. Figure 6 Address count comparison within Census Coverage Survey postcode clusters Hard to count local authority Counts 200 180 160 140 120 100 80 60 40 20 0 AR PR CT 16 ER Figure 7 compares address counts in the Address, Patient and Electoral Registers within CCS postcode clusters in an easy-to-count LA. Since the easy-to-count areas pose relatively few enumeration challenges, they have a smaller CCS sample size and fewer postcode clusters. The address counts within the three sources are in close agreement, with the Address Register consistently the highest. Figure 7 Address count comparison within Census Coverage Survey postcode clusters Easy to count local authority Counts 200 180 160 140 120 100 80 60 40 20 0 AR PR ER 17 4. Findings from the LA case studies: individuals At the individual level, Patient Register and DWP Customer Information System data were compared to ONS Mid-year population estimates (MYEs). Figure 8 summarises the range of person counts in each LA. Differences between sources will in part reflect real changes over time, particularly in areas with more mobile populations such as Inner London. There tends to be closer agreement between the sources on the person count than there was for addresses, though the latter is attenuated with the removal of the Electoral Register from the address count analysis. HTC LA4 provides a stark exception, with 1.80 times as many records on the CIS compared to the Patient Register, A striking characteristic of this comparison is that the relationship between CIS counts and the Patient Register is inconsistent. CIS counts are higher in HTC LA2, HTC LA4, MTC LA1 and MTC LA2 and lower in ETC LA1, ETC LA2, HTC LA1and HTC LA3. There is no obvious explanation for this pattern, which is explored in more detail in the LA profiles that follow. However, as noted previously the CIS, due to its role, holds biographic data on individuals (no records are ever deleted from the system) whereas Patient Registers are purged on a regular basis and therefore individuals may be removed from Patient Registers. Figure 8 Range of person counts in administrative data for case study Local Authorities Counts 450000 400000 350000 300000 250000 200000 150000 100000 50000 0 ETC LA1 ETC LA2 HTC LA1 HTC LA2 PR HTC LA3 CIS MYE 18 HTC LA4 MTC LA1 MTC LA2 4.1 Individual-level data: hard- and easy- to- count LAs compared Figure 9 compares the Patient Register, DWP Customer Information System and ONS Mid-year estimates by age and sex in a hard to count LA. The CIS counts exceed Patient Register and MYEs at all ages over around 28 years. The slight bulge around pension age in the Patient Register and MYEs is accentuated in the CIS. Further data exploration is required to understand this variance. The CIS has fewer under-18s than the Patient Register, with around half as many infants. A tentative explanation would be low levels of child benefit claims or those under-18s who have entered the country as migrants but have had no need to register for a National Insurance number. There is consistency between sources in the counts of both males and females aged approximately 18 to 30, though this could to some degree incorporate similar patterns of list inflation in both the Patient Register and CIS data. Data matching would reveal the extent to which these corresponding aggregates represent the same people. There is a steep gradient reflecting grouping and possible inflation of over-84s. Figure 9 Single year age-sex distribution using Patient Register, Customer Information System data and Midyear Estimates counts Hard to count local Age Males Females 80 70 70 60 60 50 50 40 40 30 30 20 20 10 10 0 5000 authority 4000 3000 2000 1000 MYE 80 0 1000 CIS 19 PR 2000 3000 4000 0 5000 Surprisingly there was in consistency in the person counts for the easy to count LA shown in Figure 10. The counts for the CIS were lower than those for the Patient Register and MYE at all ages. Further investigation is required to understand these differences. Figure 10 Single year age-sex distribution using Patient Register, Customer Information System data and Midyear Estimates counts Easy to count local authority Age Males Females 80 70 70 60 60 50 50 40 40 30 30 20 20 10 10 0 5000 4000 3000 2000 1000 MYE 80 0 1000 CIS PR 20 2000 3000 4000 0 5000 5. Conclusions, recommendations and next steps 5.1. Conclusions We begin by drawing together the main findings from the LA case studies, distinguishing between addressbased and individual-level data. For addresses, the range of counts in the different administrative sources was highest in hard-to-count, and London LAs. This reflects high levels of population flux in these areas and the likelihood that administrative sources are not recording out-migration in a timely way. The divergence between counts was lowest in rural areas, reflecting greater population stability in these areas. The Address Register was typically higher than other address-based sources, though address deactivations as a result of Census field operations are reducing the difference. The Electoral Register typically has the lowest address counts, because it excludes addresses where there are no eligible electors. This makes any excess count in the Electoral Register compared with the Address Register particularly useful for further investigation. The Patient Register address counts used in the analysis described here were inflated because of incomplete address matching within the Register. For individuals, there was again most divergence between administrative sources in hard-to-count LAs, particularly London. There was less divergence in easy-to-count and rural areas, with differences concentrated among young adults. However, there are inconsistent patterns of difference between the individual-level administrative datasets raise which require further investigation. Data preparation and matching between administrative sources in anticipation of the data QA process will soon give way to real data anomalies identified by the Census Quality Team and referred to the Data Matching Team for further investigation. This matching, particularly within CCS postcode clusters where Census information is most complete, will allow us to assess whether: Aggregate-level comparisons reflect the same individuals Residual administrative data records from matching can be explained through Census ‘associated addresses’ The extent to which Patient Register, School Census and HESA matching can validate or improve the Census population estimates. 21