Date: 21 November 2014 Theme: Population Semaphore Research Update Summary The Border Force (BF), part of the Home Office, is responsible for the Border Systems Portfolio (BSP). The primary aim of the BSP is to improve UK border security by collecting information on those travelling into, and out of, the UK. Data are largely collected by the Semaphore system and are referred to as Semaphore data throughout this report. Why are ONS interested in Semaphore? The Office for National Statistics (ONS) have a continuous improvement culture for all aspects of statistical production from data collection to dissemination. Semaphore data is not currently used within population and migration statistics and ONS are interested in whether the data can improve methods of measuring migration. Although the primary objective of the BSP is border security, there is a commitment within the BSP to help ONS use the data to measure migration. Readers of this report should note the distinction between statistical and operational quality. This consideration is important when using any administrative data for statistical purposes. BF’s operational requirements may be very different to ONS’ statistical requirements and features of the data that present statistical challenges for ONS may not affect operations. Can Semaphore data improve migration statistics? The UK Statistics Authority and government have previously stated that Semaphore cannot be used to directly measure international migration in its current form. To confirm this, ONS has assessed the statistical quality of an extract of Semaphore data (covering 2009 to 2012), and we have worked with BF to understand how improvements they have made since the end of 2012 may have affected statistical quality. Based on this evidence, ONS agrees that Semaphore cannot be used to directly measure migration. ONS has been able to make use of the 2009-2012 Semaphore data in some areas, for example as an additional source of reference data for International Passenger Survey (IPS) Sample Review work. We have also been able to give a fresh assessment of the other potential ways in which Semaphore data could be used to improve migration statistics if and when statistical data quality improves, for example in combination with other administrative data sources. Semaphore Research Update What will ONS do next? ONS will continue research on Semaphore data to ensure potential benefits are explored as soon as data are of sufficient statistical quality. This work will include receipt and analysis of more extracts of Semaphore data from BF. Most importantly, ONS have benefited from working closely with BF, building a clear understanding of the data systems and BF now better understand the statistical requirements of Semaphore data. ONS will continue to strengthen this relationship and make sure statistical benefits are considered alongside operational benefits as BF improve the Semaphore system further. Regular updates on this research will be published. ONS is committed to improving population and migration statistics and Semaphore is just one area being investigated. The ONS Population and Statistics Research Unit (PSRU) continue to explore other possibilities and regular updates on these are published on the ONS website. Acknowledgement We would like to thank colleagues at Border Force (BF) and Home Office for their assistance in supplying Semaphore data as well as their ongoing collaboration and engagement with the project. 1. Introduction 1.1 What is Semaphore? The Border Force (BF), part of the Home Office, is responsible for the Border Systems Portfolio (BSP). The primary aim of the BSP is to improve UK border security by collecting information on those travelling into, and out of, the UK. Data are largely collected by the Semaphore system and will be referred to as Semaphore data in this report. These naming conventions replace the terms ‘eBorders system’ and ‘e-borders data’ previously used. The aim of the BSP is to improve UK border security by collecting information electronically on those travelling into, and out of, the UK. Operators that carry international passengers to, or from, the UK by air, rail and sea are required to provide passenger, crew and service details to BF at certain stages prior to travel. This information enables BF, the police, and HM Revenue and Customs to deploy resources to deal effectively and efficiently with illegal immigration, crime and threats to UK security, whilst minimising the impact on legitimate travellers. Although the primary objective of the BSP is border security, there is a commitment within the BSP to help the Office for National Statistics (ONS) use the data to measure migration. The Semaphore system electronically collects passengers’ and crew members’ travel document information in advance of travel either into, or out of, the UK (a travel document most often refers to a passport but could relate to any other admissible Identity Card such as an EU Identity Card). The information collected is known as Advanced Passenger Information (API) and consists of the details that are contained within the machine readable zone (MRZ) of a travel document. It is mandatory for carriers to collect API and to supply it to BF when requested. Details collected include: 2 Semaphore Research Update name date of birth nationality sex travel document type country of issue document number issue date expiry date The API is analysed against a series of watch lists of people of interest to security and border agencies. The results of that analysis enable the security and border agencies to identify and target individuals of interest before they arrive in, or exit, the UK or to make subsequent interventions as necessary. In addition to API, carriers must also supply service information which includes details such as flight number flight times name of carrier departure/arrival port The Semaphore system database is defined as a ‘transactional’ administrative data source: each travel event is a ‘transaction’ and appears as a new entry on the system. The information is passed into the Semaphore system at two points (as a minimum) during the travel process: at check-in, this could be an individual checking in online at departure, either via electronic swiping or manual input of the document information 1.2 Why are ONS interested in Semaphore data? ONS have a continuous improvement culture for all aspects of statistical production from data collection to dissemination. For population and migration statistics, Semaphore data are not currently used and ONS are interested in whether the data can improve methods of measuring international migration. This report updates users on progress of this work since our last user update in June 2014. A 2012 ONS report ‘Delivering the statistical benefits from e-Borders’ outlined the potential ways in which Semaphore data could improve population and migration statistics. The report was largely theoretical because of limitations in the Semaphore data available to ONS at that time. The report concluded: 1. Semaphore data could potentially be used to: produce direct counts of long term and short term migration make improvements to the International Passenger Survey (IPS) deliver other statistical benefits, for example passenger travel statistics 3 Semaphore Research Update 2. Semaphore data could not in its current form completely replace the IPS because the IPS collects information that is not collected by the Semaphore system (for example reason for migration) 3. Further research was required, including assessment of a larger Semaphore data extract as coverage increased The 2012 report proposed that if Semaphore data provided near complete coverage and the data was of high statistical quality, then it may be possible to produce travel histories that link details of an individual’s passenger movements in and out of the UK over time. By analysing these movements, it would then be possible to infer whether or not an individual was ‘usually resident’ in the UK and whether their residence status had changed. For example, an individual person would be classified as a long-term immigrant if they: are recorded entering the UK have no record of being in the UK in the previous 12 months; and there is no record of them leaving the UK in the 12 months following their recorded entry Users of migration statistics are interested in this possibility, and in July 2013, a Public Administration Select Committee (PASC) review1 into migration statistics included a recommendation that ‘ONS and Home Office should move as quickly as possible to measuring immigration, emigration and net migration using [Semaphore’]. However, this is not straightforward. Both the UK Statistics Authority2 and the government3 have since responded highlighting limitations of the Semaphore data, and that it cannot be used to measure migration directly. A key limitation is that full (or close to full) coverage has not yet been achieved, and there are still barriers to overcome to resolve this. In addition to coverage, identification of the very small proportion of long-term migrants among the total number of travellers will be extremely difficult. It requires either: linking an individual’s travel events over time to a level of accuracy such that the derived migration estimates are more precise than those derived from the IPS – there are technical and definitional barriers that prevent this, or asking all passengers detailed questions about their travel intentions like those found on the IPS – it would not be feasible to collect this using the Semaphore system due to legal and operational constraints However, the UKSA response also confirmed that ‘ONS is continuing to work with the Home Office to identify what statistical benefits [Semaphore data] can provide, building on an initial assessment 1 http://www.publications.parliament.uk/pa/cm201314/cmselect/cmpubadm/523/523.pdf 2 http://www.statisticsauthority.gov.uk/reports---correspondence/correspondence/letter-from-sir-andrew-dilnot-to-bernard-jenkin-mpuksa-response-hc-523-06122013.pdf 3 http://www.publications.parliament.uk/pa/cm201314/cmselect/cmpubadm/1228/122804.html 4 Semaphore Research Update published in March 2012’. To that end, ONS received an extract of Semaphore data from Border Force (BF) in 2013 covering the period from April 2009 to November 2012, and have since received regular updates on how the system has developed since November 2012. This report summarises our work with the data extract and these updates. 1.3 What’s covered in this report? Significant work was undertaken to process the 2009-2012 extract given the sheer size of the dataset (850 million records), and to complete important security procedures necessary to safeguard these personal data. Data became available to ONS researchers in early 2014. Since then, ONS has assessed the quality of the data in the context of the theoretical benefits set out in our last report. Early analysis showed a rapidly changing picture throughout the period when these data were collected and BF has informed ONS that improvements have continued since the end of 2012 when the last data we have were collected. We have worked with BF to understand these changes and have included information we consider relevant to the potential statistical uses of the data to provide an up to date and complete picture. This report describes: four key aspects of data quality as we found them in the 2009-2012 extract - ‘where we were’ , and how this may have changed since the end of 2012 - ‘where we are now’ (using up-to-date information from BF) how ONS have been able to use the 2009-2012 extract in practice already how else Semaphore data could be used to improve migration statistics, likely in combination with other sources, and what data quality improvements may be required to achieve this next steps 1.4 Considerations When reading this report, please note: Semaphore is a live data system which is constantly being updated, improved and rolled-out Therefore ONS’ data extract may not reflect the current system accurately and issues identified may have already been addressed and improved by BF. To this end, ONS used the data extract for data familiarisation and basic data quality checks, but have maintained communications with BF to ensure our understandings and recommendations are up-to-date. The ‘Where we are now’ sections of the report address these discrepancies and uses current BF information to build a more up-todate picture of Semaphore as it is now. The distinction between statistical quality and operational quality of the data BF’s operational requirements may be very different to ONS’ statistical requirements and features of the data that present statistical challenges may actually benefit operations. For example, multiple entries for the same person making the same journey may be helpful for the purposes of identifying individuals of interest for security reasons. However, they present a statistical challenge because accurate statistics will depend on removing duplicate records. 5 Semaphore Research Update Data security challenges ONS have rigorous processes and procedures in place to ensure that data are protected by appropriate privacy and security safeguards. In order to comply with principles set out in the Data Protection Act 1998, ONS only requested variables identified as necessary for the research (Annex A). The Semaphore data are held within a Statistical Research Environment (SRE). The SRE is ONS’ administrative data environment which has been specifically designed to address privacy and security concerns that could arise in terms of research with personal data. As a result researchers do not have access to the identifiers associated with personal data. In the case of Semaphore data, names, dates of birth and travel document number are pseudonymised (See Annex A). This pseudonymisation presents some research challenges and indicates the need for further collaborative working with BF on processing the data. 2. Data quality An ideal system to measure migration accurately would include: one unique record for each person on a journey (no duplication) total coverage of all routes into and out of the UK accurate, complete and consistent information within each record to enable the same person to be linked across the dataset a way of collecting intentions data Semaphore was never intended or designed to address these requirements, and it cannot collect intentions information, but it has the potential to meet some of them to some degree. The Office for National Statistics (ONS) has assessed the 2009-2012 Semaphore extract against these aspects of data quality. 2.1 Duplication The Semaphore system data is defined as a ‘transactional’ administrative data source: each travel event is a ‘transaction’ and appears as a new entry on the system. The information is passed into the Semaphore system at least twice during the travel process: 1) at check-in (this could be an individual checking in online), and 2) at departure (either via electronic swiping of the document, or, dependent on the location around the globe, could be typed in manually). This design leads to multiple records for the same travel event and this duplication is an important issue for ONS to resolve. For Border Force’s (BF) operational purposes, repeated downloads of data are a benefit, as it enables more opportunities to identify people of interest. For statistical purposes, duplicate entries need to be removed because including duplicated travel events could inflate any figures and complicate any linkage methods. This is an example of a clear distinction between assessing the data in terms of statistical versus operational quality. The Semaphore system has been designed to meet its operational purpose of improving UK Border security and this will not always include the features required to produce high quality statistical outputs. 6 Semaphore Research Update Determining what was a unique record and what was an additional copy or duplicate in the 20092012 extract was a significant challenge because there was no unique identifier to link multiple records for the same travel event together. A ‘de-duplication’ strategy was developed as follows: Two types of repeated instances of a single travel event were defined: exact replicates: identical copy of a travel event, where all variables were the same as another record duplicates: non-identical copy of a travel event, where a number of defining variables were the same as another record For duplicates the defining variables were a combination of: journey information: travel date and flight code or time for ferries document information: document number and nationality person information: forename, surname and date of birth There were two stages to choosing these defining variables. Firstly the variables with the lowest level of missing and invalid data were identified. Then a logical exercise was conducted to assess what combination of variables is required to identify a unique instance of a document being used to undertake a journey by an individual. This is a relatively conservative strategy to ensure no genuine records are removed in error, but it means it is likely that some duplicates were not removed. For example, if one of the defining variables such as surname was incorrectly captured at check-in, then later corrected via an additional record, the strategy employed will retain both records because surname does not match. 2.1.1 Where we were The ‘de-duplication’ strategy removed both exact replicates and duplicates, and resulted in the removal of approximately 40% of records on average across the dataset. In terms of quantity this reduced the 850 million records provided to ONS to 509 million records. 2.1.2 Where we are now Event Type4 is a field that only began being consistently completed from October 2011, some way through the extract period and this variable will be important in better understanding and resolving duplication. It is accepted and understood that the Semaphore system operates for a specific purpose and was not initially intended to provide statistical information. BF have told us that duplicates in API data are an expectation of the Semaphore system as the system allows multiple drops of data. Continued BF and ONS collaboration will prioritise monitoring the level of duplicates in future extracts provided by BF and working together in order to inform future ONS de-duplication processes, for example exploring BF de-duplication strategies and whether ONS can use these. 4 Event type or Event code is a variable within Semaphore which distinguishes at which stage of the process the information is entered for example check-in or departure. 7 Semaphore Research Update 2.2 Coverage Coverage refers to how many entries into and exits from the UK are captured by the Semaphore system. The statistical requirements of a migration counting system would be that all instances of entry into, and exit from, the UK are captured, or where they are not, it must be possible to adjust for those missed. Otherwise some migrations could be missed altogether, and other travel events could be wrongly identified as migrations. For example, if an individual travels to the UK, then leaves via a route or method that is not covered within 12 months, they may be wrongly identified as a long-term international immigrant. 2.2.1 Where we were Based on the de-duplicated 2009-2012 extract of Semaphore data, we independently estimate that coverage increased over time, reaching 69% in 2012, when compared with Civil Aviation Authority (CAA) data. This is an estimated figure and there are reasons why it should be treated with caution. Our de-duplication strategy means that there are still likely to be duplicates of the same travel event in cases where one or more of the defining variables change between check-in and departure confirmation, potentially inflating the coverage estimate. Civil Aviation Authority (CAA) data only includes air travel, but this is comparable to ONS’ extract as the 2009-2012 extract only included a very small number of non-air traffic records.5 Future ONS extracts will include ferry and rail traffic and therefore other comparators may be needed, for example the International Passenger Survey (IPS). BF estimates that Advance Passenger Information (API) coverage increased from 57% in 2009 to 67% 2012. In other words BF’s estimate of coverage in 2012 is broadly similar to the coverage independently estimated by ONS. Even at 69%, higher coverage is required for most statistical purposes, and the 2009-2012 data cannot be used to produce estimates of international migration. 2.2.2 Where we are now BF state that API coverage (the estimated proportion of passengers who travel to and from the UK on routes connected to Semaphore) is approximately 80% in 2014. This is a significant improvement on the position at the end of 2009, where API coverage was just under 60%. Commercial air, maritime and rail coverage now stands at approximately 95%, 20% and 0% respectively. By the end of March 2015, it is envisaged that overall coverage will increase to just below 85%, with maritime rising to approximately 50%. Levels of coverage will benefit further from the Exit Checks programme that is due to go live in April 2015. Exit Checks will build on existing arrangements where possible, including using API provided by carriers, supplemented with checks conducted at the border on departure where necessary. The BF’s estimates of coverage show improvement over time and there is optimism that the coverage will continue to increase, but without near complete coverage ONS will not be able to 5 There is a very small proportion of ferry traffic and no rail traffic in this extract. 8 Semaphore Research Update produce direct counts of migration. However, even with non-complete coverage Semaphore continues to be of interest to ONS and other uses are being explored. These are discussed in more detail in section 3. As coverage continues to improve, the statistical usefulness of the data will increase and ONS will work with BF to keep up-to-date with developments. 2.3 Completeness In addition to coverage, within record variable completeness is vital to understanding the statistical quality of Semaphore data. If all travel events are captured, but individual data items important for statistical production are missing or invalid, data utility is reduced. There are operational reasons why all fields are not recorded for every record but from an ONS perspective, where data are missing or invalid, the ability to accurately link records for the same individual over time may decrease. A number of demographic variables in addition to travel document number may be important for data linkage, for example name, date of birth and sex. 2.3.1 Where we were Analysis of the extract supplied to ONS showed that approximately 96% of records have complete and valid demographic information, including date of birth, name, sex and nationality. It could be that some of the remaining 4% are duplicates that would be removed if a stricter de-duplication strategy were possible without losing any genuine records. Within the ONS extract there was no consistent unique variable to link multiple records (for example check-in and departure records) for the same travel event. However, initial analysis strongly suggested that where there was a record with missing data at check-in, the departure confirmation record contained the demographic information and vice versa. Hence, it is believed that the development of a stricter de-duplication strategy will also need to consider retaining the demographic data from whichever of the two records is more complete. Only small proportions of invalid data were found (less than 2% of records contained some invalid data). More detailed investigation showed that in the majority of cases the invalid data related to a different variable, for example a date of birth appearing in a name field. This may be caused by either input error or as a result of the technical data extraction and transfer process. 2.3.2 Where we are now BF has provided explanations for the incompleteness of some of the variables within the 2009-2012 extract. It is possible for carriers to submit information for passengers several times and from an operational point of view, as long as all the information has been submitted at least once, this is acceptable. With the ‘event type’ variable now consistently available on the Semaphore system, it will be easier to link between multiple records to create a single record per travel event. Event type was introduced in 2011. It is therefore available on only part of the 2009-2012 extract. BF estimate that record linking using event type with 2014 data results in less missing demographic data. If future ONS extracts consistently contain the event type variable, we expect that creating a single record that includes the best combination of completed fields, where there is more than one record per travel event, will be easier. In addition, BF have implemented mechanisms to improve data completeness in the areas identified by ONS. 9 Semaphore Research Update 2.4 Data consistency Data consistency refers to the extent an individual’s travel details contain the same information every time their travel information is entered onto Semaphore. There are some cases where details may change over time, reflecting individual circumstances, for example: an individual may use different documents (for example a passport on way out of the UK and a EU Identity card on way back into the UK, or travelling on different passports if the person has dual nationality) renewing an out of date passport (in the UK, every 10 years for adults, 5 years for children) changing name (perhaps through change in marital status) replacing a lost or stolen document Changes in document numbers over time mean it is likely that demographic variables will also be required in any data linkage method. However, if there are changes to demographic variables as well (for example a marriage leading to both a change in name and passport number), it would be more difficult to link the data well enough to create travel histories. 2.4.1 Where we were It is normal and, in fact, expected to find some inconsistencies and unexpected characters in administrative data, especially in free text fields such as the name variables. The name variables on Semaphore are considered important for data linkage so much of our initial work on consistency focussed on name variables. Focus on name variables To allow researchers to view real data from the name variables (that had not been pseudonymised but still met data security controls), a list of unique names present in the data were produced. All that could be deduced was that each name had appeared at least once. This process was repeated separately for each of the three separate name variables (forename, surname, and middle names) so there was no way to link the three individual lists together. In addition, all examples used in this section are not real examples, but accurately reflect issues that were found in the data. There were four main issues ONS found with the names variables within the extract that may affect the ability to link records for the same individual over time: 1. Number of words in name fields: Whilst it was expected that some name fields would contain more than one word it was expected that the majority of names would consist of a single word. However, only in surname was the modal value one word as expected, in middle name and forename the modal value was two words. Clerical review of the data showed instances where it is likely a person’s full name appeared in one field rather than spread across all three name variables, although in some cases it appeared to be caused by a number of initials. 2. Unexpected characters: The International Civil Aviation Organization (ICAO) document 9303, Machine Readable Travel Documents states that the only symbol, other than numbers or letters from the Roman alphabet, that should be on the machine readable zone of a passport is ‘<’. This is used as a separator and filler character. However, some of the data in 10 Semaphore Research Update Semaphore is manually entered and several examples of other symbols were found; for example ‘ARMS*RONG’ where an asterisk replaced a letter, or ‘5TEVE’ where a letter has been entered or read as a number. 3. Generational Suffixes. A number of entries were found with what appears to be generational suffixes, for instance JOHN DOE ll. The existence of generational suffixes is not unexpected, however if they are inconsistently applied over time it presents a challenge for data linkage of an individual’s travel events. 4. Titles. Titles and qualifications should not be on the travel document Machine Readable Zone, however a number of titles were found in the names fields, including Mrs, Mr, Miss, Ms, Dr and PhD. As with generational suffixes, if titles are inconsistently used (sometimes present, sometime not), it presents challenges should the names variables be required to achieve accurate data linkage. 2.4.2 Where we are now Operationally, all the information that BF require from the Semaphore system is captured. In terms of names, BF have sophisticated data tools to clean, categorise and link. However, high levels of consistency in the name fields are crucial if they are to be used in ONS data linkage processes. There are three main ways in which the quality of the names variables could be optimised and these are all being progressed: feeding findings back to BF to help improve the data and BF are working towards improving the demographic data at source via the introduction of Key Performance Indicators (KPIs) for operators developing better ways to process the name variables after they have been received from BF to maximise consistency over time – for example, considering whether all generational suffixes could be removed in case they are inconsistently applied working with BF to understand the processes, rules and systems they have in place for processing name variables to determine if these might meet statistical needs too 2.5 Conclusion on Data Quality ONS previously suggested that, in theory, Semaphore could be used to measure migration directly but the UK Statistics Authority6 and the government7 have provided reasons why this is not currently possible. Our analysis of the Semaphore 2009-2012 extract, and information from BF about improvements since 2012, support this view. However, the known statistical data quality issues of duplication, coverage, within record variable completeness and consistency, do not prevent Semaphore having some immediate utility and ONS have been keen to exploit these. The next section looks at what ONS has been able to achieve so far. 6 http://www.statisticsauthority.gov.uk/reports---correspondence/correspondence/letter-from-sir-andrew-dilnot-to-bernard-jenkin-mpuksa-response-hc-523-06122013.pdf 7 http://www.publications.parliament.uk/pa/cm201314/cmselect/cmpubadm/1228/122804.html 11 Semaphore Research Update 3. What has ONS been able to use Semaphore for so far? High-level aggregate counts of passenger flows from the 2009-2012 Semaphore data extract are being used by an ongoing International Passenger Survey (IPS) Sample Review. Semaphore data are particularly relevant to Port Coverage, Sample Optimisation and Out of Hours Air Traffic Coverage. In this context, while duplication may exist, if these duplicates are evenly distributed, they will not affect overall aggregate patterns or proportions. In addition, coverage and completeness at certain ports of interest may be sufficient for the needs of the IPS Sample Review and since there is not a requirement to link passengers over time for this, inconsistency in how the same individual’s information is captured between different journeys does not necessarily affect the utility of the flow data. 3.1 Background on IPS Sample review The IPS collects information relating to travel and tourism between the UK and other countries, and screens a larger sample of passengers entering and exiting the UK to identify migrants. A sample of approximately 250,000 passengers is achieved per year for the travel and tourism sample and approximately 800,000 people are interviewed as part of the migrant sift (around 4,000-5,000 of these are migrants), using a sampling frame that covers 99% of the ports/routes in to and out of the UK. Due to out of hours travel (passengers travelling at times when the IPS is not in operation) and through ports not included in the sampling frame, the IPS sample is representative of approximately 95% of all passenger traffic to and from the UK with the “missing” 5% accounted for within the weighting process. A full-scale review of the IPS sample is being carried out to ensure the IPS sample continues to be representative of UK passenger traffic and Semaphore offers a potentially valuable additional source to feed into the review. 3.2 Review of Port Coverage Semaphore data including aggregate flows of passengers by port and nationality are being used to help assess whether any of the ports currently excluded from the IPS sample are potentially important migration routes, or tend to attract overseas tourists from particular regions likely to be of interest to IPS users such as VisitBritain. 3.3 Sample Optimisation A sample optimisation for the IPS is designed to assess whether the overall sample profile needs to be updated in order to ensure the quality of the migration and tourism estimates is maintained. Semaphore data including flows of passengers by port, nationality and shift – morning, afternoon or night – is being used in conjunction with other sources to identify the most effective combination of shifts at all ports. This would provide robust data against the IPS’s three core information areas: migration; earnings and expenditure information; and travel and tourism details. Extra information provided by Semaphore may help ensure the sample design meets with stakeholders’ data quality requirements and also that the optimal sample design is cost effective and takes interviewer resource and operational constraints into consideration. 12 Semaphore Research Update 3.4 Review of Out of Hours Air Traffic Coverage Currently the IPS covers approximately sixteen hours of passenger traffic over an ‘am’ and ‘pm’ shift, typically 6:00am to 22:00pm at airports. Passengers arriving or departing on flights outside these hours are not sampled, which is accounted for in the weighting system. Consequently, as a number of flights are not covered this raises concerns that the IPS may not account for different types of flights and passengers, and there may be considerable variations between airports/terminals and by direction and quarter of the year. Semaphore data including passenger flows by port, nationality and by arrival and departure time by hourly slots will help inform or challenge the current assumption that passengers on certain routes travelling in-hours are similar to those travelling on the same routes out-of-hours. 3.5 The IPS Sample review and optimisation 2014 This section has briefly outlined some of the ways in which Semaphore may help with the IPS Sample review. The outcomes of the wider IPS Sample review project will be reported on separately through the IPS Sample Review and Optimisation Project Board. The timetable of the Sample Review is currently under review, but it is envisaged that the Sample Optimisation will be completed for implementation in January 2016. 4. How can Semaphore improve migration statistics and what improvements in data quality are required to achieve this The Office for National Statistics (ONS) agrees with the UK Statistics Authority and government position that Semaphore cannot be used to directly measure migration in its current form. However, our work with the 2009-2012 Semaphore data extract, and collaboration with Border Force (BF), have helped develop our understanding of what may be possible as the statistical quality of the Semaphore data improves. The potential uses below are presented in order of how soon it is likely any benefit could be realised. 4.1 Improving the International Passenger Survey Section 3 explained how ONS is already using Semaphore data as part of ongoing International Passenger Survey (IPS) improvement work. This will continue, using both the data we have, and more up-to-date Semaphore extracts as we complete work with BF to acquire these (see section 5). The main data quality requirement in this area is that coverage continues to increase. In the 20092012 extract, there is a lot of variation in coverage by port which limits how useful the data can be. Current levels of duplication and within record completeness may not be a barrier in this instance. Data consistency is also less important in this context because accurate linking of an individual’s travel over time is not required. It is worth reiterating that Semaphore cannot completely replace the IPS which collects different pieces of information that are required by users of migration statistics. The IPS includes information on reason for stay and also intended length of stay. While intended length of stay may not be 13 Semaphore Research Update realised in subsequent actual behaviour, affecting accuracy, it is timelier because we don’t need to wait and observe the behaviour. Using Semaphore data in conjunction with the IPS is the most likely short term way in which Semaphore data could be used to improve migration statistics. 4.2 Combining the data with other administrative sources Across all topics, ONS is looking to increase its use of administrative data - that is data already collected, often by government, for other reasons – in statistical production. As ONS has shown for Semaphore data, there are often limitations in the statistical utility of an administrative dataset precisely because the data are not collected for statistical purposes. However, by combining multiple administrative sources that have different strengths and weaknesses, these limitations can often be mitigated or overcome. 4.2.1 Using administrative data to measure the population, particularly at the local level From a population statistics perspective, the ONS Beyond 2011 programme has been assessing the different possible approaches to producing population and housing statistics in future. Administrative sources currently being researched by the programme include those held by the Department of Health, the Department of Work and Pensions, HM Revenue and Customs, the Department for Education and the Higher Education Statistics Agency.8 As a result of the programme’s research, in March 2014, the National Statistician recommended a predominantly online census in 2021 supplemented by the further use of administrative and survey data. This approach has been approved by government. As research using administrative data to measure the population continues, how immigrants first appear on administrative sources, and if and how emigrants are removed from those sources, will be important. For example, in terms of immigration, when a patient registers with a new GP for the first time, they are asked whether they are from abroad and when they first arrived in the UK. Capturing emigration is likely to be more challenging but inactivity on employment related databases may be one possibility. Semaphore may have a role to play in quality assuring findings from other administrative sources, for example checking whether an individual identified as a potential immigrant or emigrant on another source, has also been recently recorded entering or leaving the UK by Semaphore. For Semaphore data to be used in this way, coverage will be important but not necessarily crucial because the other sources may cover individuals missing from Semaphore. Duplication for an individual journey may also not affect the utility of the data if the initial migrant identification is driven by another source where duplication has been easier to resolve. However, complete and accurate data within a record will be important because accurate linkage of individuals across datasets would be required. 8 A full list can be found in Annex C of this report: ‘Beyond 2011: Safeguarding Data for Research: Our Policy', July 2013‘ 14 Semaphore Research Update 4.2.2 Using administrative data to measure international migration, particularly at the national level The initial focus of the Beyond 2011 programme research has been investigating ways to estimate the population by age and sex for sub-national areas (e.g. local authorities). Confirming whether a record (for a person at an address) on an individual administrative source should contribute to a population estimate for a given area will be an important part of ongoing research. If those migrating internationally can be distinguished from those migrating elsewhere within the UK, it may lead to statistics about flows of people as well as about people in a particular area. However, setting out to better measure international migration specifically, particularly at a national level, may lead to a different methodology or combination of sources. To explore this fully and deliver results quickly would require a separate programme of work. Semaphore data would be an important source in such research. An important measure of success for any method developed would be whether it can deliver more accurate estimates than the IPS. It is worth noting that under current immigration policy, finding a combination of administrative sources that achieves this will be more realistic for non-EU migration than EU migration. Visa and other data are generated for non-EU migrants by mandatory interactions with the immigration system and these data may be required to account for limitations in other administrative sources such as Semaphore. Even in the case of non-EU migrants, a high match rate between Semaphore and visa data would be required to confirm entry and exit dates of those with a visa. Semaphore would need complete or near complete coverage, and a high level of within record completeness and accuracy to maximise the chances of success. 4.3 Estimating the population present in the UK If all flows in and out of the UK were accurately recorded by key characteristics such as nationality, age and gender, this information would provide insight into the population present in the UK. This would provide an indirect picture of migration even if individual travel histories were not linked over time. A count of arrivals and departures, and therefore net in/outflow by nationality, will include a large number of visitors, and a much smaller number of short and long-term migrants. However, visitors will leave after a short stay. Therefore, rolling counts starting from any given reference point could potentially provide information on any long-term migration trends via discrepancies between in and out flows. Such an approach would compare two enormous flows of close to 100 million movements in and out of the UK per year and even a very small degree of inaccuracy in such large counts could magnify the net effect implied by differences between such large underlying numbers. For example, a bias in one flow of just 0.1% would add 100,000 to the annual difference between flows. Although less ambitious than travel histories, statistics of this nature would require complete or near complete coverage, very low levels of duplication, and complete and accurate data for the characteristics of interest (for example, nationality). This is because on average, approximately 1 in every 200 passenger movements is a migration event, so any differences between in and out flows 15 Semaphore Research Update caused by limitations or statistical bias in the data (for example, a port that is more common for departures than arrivals not being covered) could mask any real world differences. 4.4 Travel histories If travel into and out of the UK were accurately recorded once per person per journey, it would in theory be possible to link people’s travel patterns over time. For a proportion of individuals, this will already be possible. However, there are many barriers to translating these travel histories into an estimate of migration, especially one we could be confident was more accurate than the IPS. Estimation of international migration would depend on identification of individuals where there is a record of entry but not departure (immigration), or a record of departure but not return (emigration), within a specified period (12 months for long-term migration estimates according to the UN definition used by ONS). As described in section 4.3, approximately 1 in 200 passenger movements is a migration event. If even a small proportion of the other 199 in 200 movements were misclassified as a migration event, the precision of any migration estimate based on such a method is unlikely to be more precise than the IPS. This misclassification could happen in a number of ways, and taking immigration as an example: without complete coverage, an individual may enter via a route that is covered but leave via a route that is not, inflating any estimate of immigration. if there were two or more records for an individual entering the UK but only one when they left, and the duplicates could not be easily identified (or all paired with the exit record), the additional entry record could be taken as evidence of additional immigration (in error) if there was a failure of the longitudinal linking of an individual’s travel data in creating their travel history, such that their entry and exit records were not paired together, their entry record could inflate any immigration estimate. It is noted in this case there would also be an incorrect but compensating inflation of any emigration estimate caused by their unpaired exit record. Because individual document numbers can change (for example, when a passport is renewed, when a name could also change e.g. following marriage), and dual nationals can use different passports on entry and exit, high levels of completeness, accuracy and consistency over time of demographic information There would also be an issue over timeliness to overcome because the above process relies on assessing an individual’s actual behaviour during the 12 months after they enter / leave the UK (for long-term migration estimates), not their intentions. The estimate for any given reference period would require some form of provisional estimation to prevent a lag of over 12 months between the end of the reference period and release of the estimates. These provisional estimates would then be revised as more data about actual behaviour since the reference period became available. At present, ONS publish migration statistics with a 5-month lag between the reference period and the publication date. 4.5 Potential uses not directly related to migration Short term: In the IPS travel and tourism sample, there are a higher number of UK residents and overseas residents starting their visit than ending their visit. This is important because the IPS 16 Semaphore Research Update overseas travel and tourism estimates are only based on the number of completed visits (i.e. they are calculated from data collected only from interviews with UK residents as they arrive back in the UK and with overseas residents as they depart the UK at the end of their visits). Imbalance may result in visits and spending being under-estimated. Semaphore data can provide passenger flows by arrival and departure airport and nationality so may help inform whether the current weighting applied to correct this imbalance needs adjustment. Long term: Reliable total flows by important characteristics such as nationality and port of entry (as per section 4.3) may allow new statistics about travel and tourism. In this case, because separating out the small number of migrants from everyone else is less important, the statistics would be more robust to any error caused by unresolved duplication or incomplete records. However, improving coverage would still be important because different nationalities will visit the UK via different methods / ports, so any gaps in coverage will lead to bias in the gaps within the data. 5. Next Steps More research is planned by the Office for National Statistics (ONS) to ensure potential benefits are explored and realised as soon as the Semaphore data are of sufficient statistical quality. Specifically: 5.1 Future Semaphore data extracts ONS have begun the process to request more frequent, smaller extracts of Semaphore data. We will use these extracts in combination with regular advice from Border Force (BF) colleagues to monitor data quality and determine when to research any given potential improvement further. 5.2 Regular communications with BF officials ONS have benefitted from working closely with BF to ensure we have a firm understanding of the data system and processes. BF officials better understand how changes to collection and processing may affect, and hopefully improve the statistical utility of the data. ONS and BF will establish a regular working group to formalise this information sharing process, ensuring statistical benefits are considered alongside operational benefits as the Semaphore system develops. 5.3 Refining ONS’ de-duplication strategy Work is planned to create a more robust de-duplication strategy guided by BF advice. This may also improve other indicators of statistical quality. For example, where two records are incomplete, it may be because they relate to the same person on the same journey and are missing different pieces of information in each case. Once this duplication is resolved, and the records combined, instances of incomplete fields will decrease. 5.4 Regular user updates ONS will publish regular updates and users will continue to be updated via events such as the annual Migration Statistics User Forum conference. At the most recent MSUF conference (September 2014), it was evident that users are still interested in the outcome of this work. The planned work above will address the following research questions: 17 Semaphore Research Update 1. How has data quality improved as new data extracts are received and analysed? 2. How useful has the Semaphore data been in the ongoing IPS improvement work? What have been the limitations? Can this information be used to be more specific about the quality improvements required to allow Semaphore to be permanently integrated into an IPS optimisation process? 3. Increased coverage is fundamental in facilitating all the potential benefits and this can only be increased by BF. In the meantime, can BF and ONS develop techniques that maximise the other aspects of statistical quality? ONS remains committed to improving population and migration statistics and Semaphore is just one area being investigated. The ONS Population Statistics Research Unit (PSRU) continues to explore several other improvements and regular updates on these are published on the ONS website. Any questions on this report should be directed to the PSRU inbox. 18 Semaphore Research Update Annex A – Semaphore variables received Summary of Semaphore variables received, if they are pseudonymised by ONS and reasons included in extract Document related variables Person related variables Forename Yes X X Middle names Yes X X Surname Yes X X Date of birth Yes X X X X X X Sex Nationality Document number Other Reason included in extract Key output variable Pseudononymised by ONS Facilitate entry and exit linkage Person centric travel histories Variable name X Yes X Document type X Country of issue X X Issue date X Expiry date X 19 Semaphore Research Update Flight related variables Other Key output variable Person centric travel histories Reason included in extract Carrier Validate route information, investigate quality by carrier Flight code Validate route information Arrival date and time X X Date sequencing Departure date and time X X Date sequencing Arrival airport X Departure airport X Embark airport X Disembark airport X OPI (booking) code Other Pseudonymised by ONS Facilitate entry and exit linkage Variable name X X Linking individuals travelling together Passenger type Remove flight crew Event type Ascertain data download point 20 Semaphore Research Update Glossary of terms Advance Passenger Information (API) Border Force (BF) Advance Passenger Information (API) is information from an individual’s passport required by the government of the country being travelled to. Border Force is a law enforcement command within the Home Office. It secures the UK border by carrying out immigration and customs controls for people and goods entering the UK. Border Force was formed on 1 March 2012 as a law enforcement command within the Home Office. Border Force secures the border and promotes national prosperity by facilitating the legitimate movement of individuals and goods, whilst preventing those that would cause harm from entering the UK. This is achieved through the immigration and customs checks carried out by our staff at ports and airports. International Passenger Survey (IPS) The International Passenger Survey (IPS) collects information about passengers entering and leaving the UK, and has been running continuously since 1961. The IPS conducts between 700,000 and 800,000 interviews a year of which over 250,000 are used to produce estimates of Overseas Travel and Tourism. Interviews are carried out at all major airports and sea routes, at Eurostar terminals and on Eurotunnel shuttle trains. The survey results are used by various government departments, including the Office for National Statistics (ONS), the Department for Transport, the Home Office, HM Revenue and Customs, VisitBritain and the national and regional Tourist Boards. The results are primarily used to: 21 measure the impact of travel expenditure on the UK economy estimate the numbers and characteristics of migrants into and out of the UK provide information about international Semaphore Research Update tourism and how it has changed over time International Civil Aviation Organization (ICAO) The International Civil Aviation Organization (ICAO) is a UN specialized agency, created in 1944 upon the signing of the Convention on International Civil Aviation (Chicago Convention). ICAO works with the Convention’s 191 Signatory States and global industry and aviation organizations to develop international Standards and Recommended Practices (SARPs) which are then used by States when they develop their legally-binding national civil aviation regulations. There are currently over 10,000 SARPs reflected in the 19 Annexes to the Chicago Convention which ICAO oversees, and it is through these SARPs and ICAO’s complementary policy, auditing and capacity-building efforts that today’s global air transport network is able to operate over 100,000 daily flights, safely, efficiently and securely in every region of the world. Long-term International Migrant A person who moves to a country other than that of his or her usual residence for a period of at least a year (12 months), so that the country of destination effectively becomes his or her new country of usual residence.’ - United Nations (1998) Public Administration Select Committee (PASC) PASCs examine the quality and standards of administration within the Civil Service and scrutinises the reports of the Parliamentary and Health Service Ombudsman. A single travel event can be a departure or an arrival to or from a UK travel port which is recorded on the Semaphore system. The UK Statistics Authority is an independent body operating at arm's length from government as a non-ministerial department, directly accountable to Parliament. It was established on 1 April 2008 by the Statistics and Registration Service Act 2007. Travel event UK Statistics Authority The Authority's statutory objective is to promote and safeguard the production and publication of 22 Semaphore Research Update official statistics that serve the public good. It is also required to promote and safeguard the quality and comprehensiveness of official statistics, and ensure good practice in relation to official statistics. The UK Statistics Authority has two main functions: 1. oversight of the Office for National Statistics (ONS) - its executive office 2. independent scrutiny (monitoring and assessment) of all official statistics produced in the UK. Membership of the Authority's Board comprises the Chair of the Authority, seven other nonexecutive members, and three executive members. 23