Semaphore Research Update - Office for National Statistics

advertisement
Date: 21 November 2014
Theme: Population
Semaphore Research Update
Summary
The Border Force (BF), part of the Home Office, is responsible for the Border Systems Portfolio (BSP).
The primary aim of the BSP is to improve UK border security by collecting information on those
travelling into, and out of, the UK. Data are largely collected by the Semaphore system and are
referred to as Semaphore data throughout this report.
Why are ONS interested in Semaphore?
The Office for National Statistics (ONS) have a continuous improvement culture for all aspects of
statistical production from data collection to dissemination. Semaphore data is not currently used
within population and migration statistics and ONS are interested in whether the data can improve
methods of measuring migration. Although the primary objective of the BSP is border security,
there is a commitment within the BSP to help ONS use the data to measure migration.
Readers of this report should note the distinction between statistical and operational quality. This
consideration is important when using any administrative data for statistical purposes. BF’s
operational requirements may be very different to ONS’ statistical requirements and features of the
data that present statistical challenges for ONS may not affect operations.
Can Semaphore data improve migration statistics?
The UK Statistics Authority and government have previously stated that Semaphore cannot be used
to directly measure international migration in its current form. To confirm this, ONS has assessed the
statistical quality of an extract of Semaphore data (covering 2009 to 2012), and we have worked
with BF to understand how improvements they have made since the end of 2012 may have affected
statistical quality. Based on this evidence, ONS agrees that Semaphore cannot be used to directly
measure migration.
ONS has been able to make use of the 2009-2012 Semaphore data in some areas, for example as an
additional source of reference data for International Passenger Survey (IPS) Sample Review work.
We have also been able to give a fresh assessment of the other potential ways in which Semaphore
data could be used to improve migration statistics if and when statistical data quality improves, for
example in combination with other administrative data sources.
Semaphore Research Update
What will ONS do next?
ONS will continue research on Semaphore data to ensure potential benefits are explored as soon as
data are of sufficient statistical quality. This work will include receipt and analysis of more extracts of
Semaphore data from BF. Most importantly, ONS have benefited from working closely with BF,
building a clear understanding of the data systems and BF now better understand the statistical
requirements of Semaphore data. ONS will continue to strengthen this relationship and make sure
statistical benefits are considered alongside operational benefits as BF improve the Semaphore
system further. Regular updates on this research will be published.
ONS is committed to improving population and migration statistics and Semaphore is just one area
being investigated. The ONS Population and Statistics Research Unit (PSRU) continue to explore
other possibilities and regular updates on these are published on the ONS website.
Acknowledgement
We would like to thank colleagues at Border Force (BF) and Home Office for their assistance in
supplying Semaphore data as well as their ongoing collaboration and engagement with the project.
1. Introduction
1.1 What is Semaphore?
The Border Force (BF), part of the Home Office, is responsible for the Border Systems Portfolio (BSP).
The primary aim of the BSP is to improve UK border security by collecting information on those
travelling into, and out of, the UK. Data are largely collected by the Semaphore system and will be
referred to as Semaphore data in this report. These naming conventions replace the terms ‘eBorders system’ and ‘e-borders data’ previously used.
The aim of the BSP is to improve UK border security by collecting information electronically on those
travelling into, and out of, the UK. Operators that carry international passengers to, or from, the UK
by air, rail and sea are required to provide passenger, crew and service details to BF at certain stages
prior to travel. This information enables BF, the police, and HM Revenue and Customs to deploy
resources to deal effectively and efficiently with illegal immigration, crime and threats to UK
security, whilst minimising the impact on legitimate travellers. Although the primary objective of the
BSP is border security, there is a commitment within the BSP to help the Office for National Statistics
(ONS) use the data to measure migration.
The Semaphore system electronically collects passengers’ and crew members’ travel document
information in advance of travel either into, or out of, the UK (a travel document most often refers
to a passport but could relate to any other admissible Identity Card such as an EU Identity Card). The
information collected is known as Advanced Passenger Information (API) and consists of the details
that are contained within the machine readable zone (MRZ) of a travel document. It is mandatory for
carriers to collect API and to supply it to BF when requested. Details collected include:
2
Semaphore Research Update









name
date of birth
nationality
sex
travel document type
country of issue
document number
issue date
expiry date
The API is analysed against a series of watch lists of people of interest to security and border
agencies. The results of that analysis enable the security and border agencies to identify and target
individuals of interest before they arrive in, or exit, the UK or to make subsequent interventions as
necessary.
In addition to API, carriers must also supply service information which includes details such as




flight number
flight times
name of carrier
departure/arrival port
The Semaphore system database is defined as a ‘transactional’ administrative data source: each
travel event is a ‘transaction’ and appears as a new entry on the system. The information is passed
into the Semaphore system at two points (as a minimum) during the travel process:


at check-in, this could be an individual checking in online
at departure, either via electronic swiping or manual input of the document information
1.2 Why are ONS interested in Semaphore data?
ONS have a continuous improvement culture for all aspects of statistical production from data
collection to dissemination. For population and migration statistics, Semaphore data are not
currently used and ONS are interested in whether the data can improve methods of measuring
international migration. This report updates users on progress of this work since our last user
update in June 2014.
A 2012 ONS report ‘Delivering the statistical benefits from e-Borders’ outlined the potential ways in
which Semaphore data could improve population and migration statistics. The report was largely
theoretical because of limitations in the Semaphore data available to ONS at that time. The report
concluded:
1.
Semaphore data could potentially be used to:
 produce direct counts of long term and short term migration
 make improvements to the International Passenger Survey (IPS)
 deliver other statistical benefits, for example passenger travel statistics
3
Semaphore Research Update
2. Semaphore data could not in its current form completely replace the IPS because the IPS
collects information that is not collected by the Semaphore system (for example reason
for migration)
3. Further research was required, including assessment of a larger Semaphore data extract
as coverage increased
The 2012 report proposed that if Semaphore data provided near complete coverage and the data
was of high statistical quality, then it may be possible to produce travel histories that link details of
an individual’s passenger movements in and out of the UK over time. By analysing these movements,
it would then be possible to infer whether or not an individual was ‘usually resident’ in the UK and
whether their residence status had changed. For example, an individual person would be classified
as a long-term immigrant if they:



are recorded entering the UK
have no record of being in the UK in the previous 12 months; and
there is no record of them leaving the UK in the 12 months following their recorded
entry
Users of migration statistics are interested in this possibility, and in July 2013, a Public
Administration Select Committee (PASC) review1 into migration statistics included a
recommendation that ‘ONS and Home Office should move as quickly as possible to measuring
immigration, emigration and net migration using [Semaphore’].
However, this is not straightforward. Both the UK Statistics Authority2 and the government3 have
since responded highlighting limitations of the Semaphore data, and that it cannot be used to
measure migration directly. A key limitation is that full (or close to full) coverage has not yet been
achieved, and there are still barriers to overcome to resolve this. In addition to coverage,
identification of the very small proportion of long-term migrants among the total number of
travellers will be extremely difficult. It requires either:


linking an individual’s travel events over time to a level of accuracy such that the derived
migration estimates are more precise than those derived from the IPS – there are
technical and definitional barriers that prevent this, or
asking all passengers detailed questions about their travel intentions like those found on
the IPS – it would not be feasible to collect this using the Semaphore system due to legal
and operational constraints
However, the UKSA response also confirmed that ‘ONS is continuing to work with the Home Office to
identify what statistical benefits [Semaphore data] can provide, building on an initial assessment
1
http://www.publications.parliament.uk/pa/cm201314/cmselect/cmpubadm/523/523.pdf
2
http://www.statisticsauthority.gov.uk/reports---correspondence/correspondence/letter-from-sir-andrew-dilnot-to-bernard-jenkin-mpuksa-response-hc-523-06122013.pdf
3
http://www.publications.parliament.uk/pa/cm201314/cmselect/cmpubadm/1228/122804.html
4
Semaphore Research Update
published in March 2012’. To that end, ONS received an extract of Semaphore data from Border
Force (BF) in 2013 covering the period from April 2009 to November 2012, and have since received
regular updates on how the system has developed since November 2012. This report summarises
our work with the data extract and these updates.
1.3 What’s covered in this report?
Significant work was undertaken to process the 2009-2012 extract given the sheer size of the dataset
(850 million records), and to complete important security procedures necessary to safeguard these
personal data. Data became available to ONS researchers in early 2014. Since then, ONS has
assessed the quality of the data in the context of the theoretical benefits set out in our last report.
Early analysis showed a rapidly changing picture throughout the period when these data were
collected and BF has informed ONS that improvements have continued since the end of 2012 when
the last data we have were collected. We have worked with BF to understand these changes and
have included information we consider relevant to the potential statistical uses of the data to
provide an up to date and complete picture.
This report describes:




four key aspects of data quality as we found them in the 2009-2012 extract - ‘where we
were’ , and how this may have changed since the end of 2012 - ‘where we are now’
(using up-to-date information from BF)
how ONS have been able to use the 2009-2012 extract in practice already
how else Semaphore data could be used to improve migration statistics, likely in
combination with other sources, and what data quality improvements may be required
to achieve this
next steps
1.4 Considerations
When reading this report, please note:
Semaphore is a live data system which is constantly being updated, improved and rolled-out
Therefore ONS’ data extract may not reflect the current system accurately and issues identified may
have already been addressed and improved by BF. To this end, ONS used the data extract for data
familiarisation and basic data quality checks, but have maintained communications with BF to
ensure our understandings and recommendations are up-to-date. The ‘Where we are now’ sections
of the report address these discrepancies and uses current BF information to build a more up-todate picture of Semaphore as it is now.
The distinction between statistical quality and operational quality of the data
BF’s operational requirements may be very different to ONS’ statistical requirements and features of
the data that present statistical challenges may actually benefit operations. For example, multiple
entries for the same person making the same journey may be helpful for the purposes of identifying
individuals of interest for security reasons. However, they present a statistical challenge because
accurate statistics will depend on removing duplicate records.
5
Semaphore Research Update
Data security challenges
ONS have rigorous processes and procedures in place to ensure that data are protected by
appropriate privacy and security safeguards. In order to comply with principles set out in the Data
Protection Act 1998, ONS only requested variables identified as necessary for the research (Annex
A). The Semaphore data are held within a Statistical Research Environment (SRE). The SRE is ONS’
administrative data environment which has been specifically designed to address privacy and
security concerns that could arise in terms of research with personal data. As a result researchers do
not have access to the identifiers associated with personal data. In the case of Semaphore data,
names, dates of birth and travel document number are pseudonymised (See Annex A). This
pseudonymisation presents some research challenges and indicates the need for further
collaborative working with BF on processing the data.
2. Data quality
An ideal system to measure migration accurately would include:




one unique record for each person on a journey (no duplication)
total coverage of all routes into and out of the UK
accurate, complete and consistent information within each record to enable the same
person to be linked across the dataset
a way of collecting intentions data
Semaphore was never intended or designed to address these requirements, and it cannot collect
intentions information, but it has the potential to meet some of them to some degree. The Office
for National Statistics (ONS) has assessed the 2009-2012 Semaphore extract against these aspects of
data quality.
2.1 Duplication
The Semaphore system data is defined as a ‘transactional’ administrative data source: each travel
event is a ‘transaction’ and appears as a new entry on the system. The information is passed into
the Semaphore system at least twice during the travel process: 1) at check-in (this could be an
individual checking in online), and 2) at departure (either via electronic swiping of the document, or,
dependent on the location around the globe, could be typed in manually).
This design leads to multiple records for the same travel event and this duplication is an important
issue for ONS to resolve. For Border Force’s (BF) operational purposes, repeated downloads of data
are a benefit, as it enables more opportunities to identify people of interest. For statistical purposes,
duplicate entries need to be removed because including duplicated travel events could inflate any
figures and complicate any linkage methods. This is an example of a clear distinction between
assessing the data in terms of statistical versus operational quality. The Semaphore system has been
designed to meet its operational purpose of improving UK Border security and this will not always
include the features required to produce high quality statistical outputs.
6
Semaphore Research Update
Determining what was a unique record and what was an additional copy or duplicate in the 20092012 extract was a significant challenge because there was no unique identifier to link multiple
records for the same travel event together. A ‘de-duplication’ strategy was developed as follows:
Two types of repeated instances of a single travel event were defined:


exact replicates: identical copy of a travel event, where all variables were the same as
another record
duplicates: non-identical copy of a travel event, where a number of defining variables were
the same as another record
For duplicates the defining variables were a combination of:



journey information: travel date and flight code or time for ferries
document information: document number and nationality
person information: forename, surname and date of birth
There were two stages to choosing these defining variables. Firstly the variables with the lowest
level of missing and invalid data were identified. Then a logical exercise was conducted to assess
what combination of variables is required to identify a unique instance of a document being used to
undertake a journey by an individual. This is a relatively conservative strategy to ensure no genuine
records are removed in error, but it means it is likely that some duplicates were not removed. For
example, if one of the defining variables such as surname was incorrectly captured at check-in, then
later corrected via an additional record, the strategy employed will retain both records because
surname does not match.
2.1.1 Where we were
The ‘de-duplication’ strategy removed both exact replicates and duplicates, and resulted in the
removal of approximately 40% of records on average across the dataset. In terms of quantity this
reduced the 850 million records provided to ONS to 509 million records.
2.1.2 Where we are now
Event Type4 is a field that only began being consistently completed from October 2011, some way
through the extract period and this variable will be important in better understanding and resolving
duplication. It is accepted and understood that the Semaphore system operates for a specific
purpose and was not initially intended to provide statistical information. BF have told us that
duplicates in API data are an expectation of the Semaphore system as the system allows multiple
drops of data.
Continued BF and ONS collaboration will prioritise monitoring the level of duplicates in future
extracts provided by BF and working together in order to inform future ONS de-duplication
processes, for example exploring BF de-duplication strategies and whether ONS can use these.
4
Event type or Event code is a variable within Semaphore which distinguishes at which stage of the process
the information is entered for example check-in or departure.
7
Semaphore Research Update
2.2 Coverage
Coverage refers to how many entries into and exits from the UK are captured by the Semaphore
system. The statistical requirements of a migration counting system would be that all instances of
entry into, and exit from, the UK are captured, or where they are not, it must be possible to adjust
for those missed. Otherwise some migrations could be missed altogether, and other travel events
could be wrongly identified as migrations. For example, if an individual travels to the UK, then leaves
via a route or method that is not covered within 12 months, they may be wrongly identified as a
long-term international immigrant.
2.2.1 Where we were
Based on the de-duplicated 2009-2012 extract of Semaphore data, we independently estimate that
coverage increased over time, reaching 69% in 2012, when compared with Civil Aviation Authority
(CAA) data. This is an estimated figure and there are reasons why it should be treated with caution.
Our de-duplication strategy means that there are still likely to be duplicates of the same travel event
in cases where one or more of the defining variables change between check-in and departure
confirmation, potentially inflating the coverage estimate.
Civil Aviation Authority (CAA) data only includes air travel, but this is comparable to ONS’ extract as
the 2009-2012 extract only included a very small number of non-air traffic records.5 Future ONS
extracts will include ferry and rail traffic and therefore other comparators may be needed, for
example the International Passenger Survey (IPS).
BF estimates that Advance Passenger Information (API) coverage increased from 57% in 2009 to 67%
2012. In other words BF’s estimate of coverage in 2012 is broadly similar to the coverage
independently estimated by ONS. Even at 69%, higher coverage is required for most statistical
purposes, and the 2009-2012 data cannot be used to produce estimates of international migration.
2.2.2 Where we are now
BF state that API coverage (the estimated proportion of passengers who travel to and from the UK
on routes connected to Semaphore) is approximately 80% in 2014. This is a significant improvement
on the position at the end of 2009, where API coverage was just under 60%. Commercial air,
maritime and rail coverage now stands at approximately 95%, 20% and 0% respectively. By the end
of March 2015, it is envisaged that overall coverage will increase to just below 85%, with maritime
rising to approximately 50%.
Levels of coverage will benefit further from the Exit Checks programme that is due to go live in April
2015. Exit Checks will build on existing arrangements where possible, including using API provided by
carriers, supplemented with checks conducted at the border on departure where necessary.
The BF’s estimates of coverage show improvement over time and there is optimism that the
coverage will continue to increase, but without near complete coverage ONS will not be able to
5
There is a very small proportion of ferry traffic and no rail traffic in this extract.
8
Semaphore Research Update
produce direct counts of migration. However, even with non-complete coverage Semaphore
continues to be of interest to ONS and other uses are being explored. These are discussed in more
detail in section 3. As coverage continues to improve, the statistical usefulness of the data will
increase and ONS will work with BF to keep up-to-date with developments.
2.3 Completeness
In addition to coverage, within record variable completeness is vital to understanding the statistical
quality of Semaphore data. If all travel events are captured, but individual data items important for
statistical production are missing or invalid, data utility is reduced. There are operational reasons
why all fields are not recorded for every record but from an ONS perspective, where data are
missing or invalid, the ability to accurately link records for the same individual over time may
decrease. A number of demographic variables in addition to travel document number may be
important for data linkage, for example name, date of birth and sex.
2.3.1 Where we were
Analysis of the extract supplied to ONS showed that approximately 96% of records have complete
and valid demographic information, including date of birth, name, sex and nationality. It could be
that some of the remaining 4% are duplicates that would be removed if a stricter de-duplication
strategy were possible without losing any genuine records.
Within the ONS extract there was no consistent unique variable to link multiple records (for example
check-in and departure records) for the same travel event. However, initial analysis strongly
suggested that where there was a record with missing data at check-in, the departure confirmation
record contained the demographic information and vice versa. Hence, it is believed that the
development of a stricter de-duplication strategy will also need to consider retaining the
demographic data from whichever of the two records is more complete.
Only small proportions of invalid data were found (less than 2% of records contained some invalid
data). More detailed investigation showed that in the majority of cases the invalid data related to a
different variable, for example a date of birth appearing in a name field. This may be caused by
either input error or as a result of the technical data extraction and transfer process.
2.3.2 Where we are now
BF has provided explanations for the incompleteness of some of the variables within the 2009-2012
extract. It is possible for carriers to submit information for passengers several times and from an
operational point of view, as long as all the information has been submitted at least once, this is
acceptable. With the ‘event type’ variable now consistently available on the Semaphore system, it
will be easier to link between multiple records to create a single record per travel event.
Event type was introduced in 2011. It is therefore available on only part of the 2009-2012 extract. BF
estimate that record linking using event type with 2014 data results in less missing demographic
data. If future ONS extracts consistently contain the event type variable, we expect that creating a
single record that includes the best combination of completed fields, where there is more than one
record per travel event, will be easier. In addition, BF have implemented mechanisms to improve
data completeness in the areas identified by ONS.
9
Semaphore Research Update
2.4 Data consistency
Data consistency refers to the extent an individual’s travel details contain the same information
every time their travel information is entered onto Semaphore. There are some cases where details
may change over time, reflecting individual circumstances, for example:




an individual may use different documents (for example a passport on way out of the UK and
a EU Identity card on way back into the UK, or travelling on different passports if the person
has dual nationality)
renewing an out of date passport (in the UK, every 10 years for adults, 5 years for children)
changing name (perhaps through change in marital status)
replacing a lost or stolen document
Changes in document numbers over time mean it is likely that demographic variables will also be
required in any data linkage method. However, if there are changes to demographic variables as well
(for example a marriage leading to both a change in name and passport number), it would be more
difficult to link the data well enough to create travel histories.
2.4.1 Where we were
It is normal and, in fact, expected to find some inconsistencies and unexpected characters in
administrative data, especially in free text fields such as the name variables. The name variables on
Semaphore are considered important for data linkage so much of our initial work on consistency
focussed on name variables.
Focus on name variables
To allow researchers to view real data from the name variables (that had not been pseudonymised
but still met data security controls), a list of unique names present in the data were produced. All
that could be deduced was that each name had appeared at least once. This process was repeated
separately for each of the three separate name variables (forename, surname, and middle names) so
there was no way to link the three individual lists together. In addition, all examples used in this
section are not real examples, but accurately reflect issues that were found in the data.
There were four main issues ONS found with the names variables within the extract that may affect
the ability to link records for the same individual over time:
1. Number of words in name fields: Whilst it was expected that some name fields would
contain more than one word it was expected that the majority of names would consist of a
single word. However, only in surname was the modal value one word as expected, in
middle name and forename the modal value was two words. Clerical review of the data
showed instances where it is likely a person’s full name appeared in one field rather than
spread across all three name variables, although in some cases it appeared to be caused by a
number of initials.
2. Unexpected characters: The International Civil Aviation Organization (ICAO) document 9303,
Machine Readable Travel Documents states that the only symbol, other than numbers or
letters from the Roman alphabet, that should be on the machine readable zone of a
passport is ‘<’. This is used as a separator and filler character. However, some of the data in
10
Semaphore Research Update
Semaphore is manually entered and several examples of other symbols were found; for
example ‘ARMS*RONG’ where an asterisk replaced a letter, or ‘5TEVE’ where a letter has
been entered or read as a number.
3. Generational Suffixes. A number of entries were found with what appears to be
generational suffixes, for instance JOHN DOE ll. The existence of generational suffixes is not
unexpected, however if they are inconsistently applied over time it presents a challenge for
data linkage of an individual’s travel events.
4. Titles. Titles and qualifications should not be on the travel document Machine Readable
Zone, however a number of titles were found in the names fields, including Mrs, Mr, Miss,
Ms, Dr and PhD. As with generational suffixes, if titles are inconsistently used (sometimes
present, sometime not), it presents challenges should the names variables be required to
achieve accurate data linkage.
2.4.2 Where we are now
Operationally, all the information that BF require from the Semaphore system is captured. In terms
of names, BF have sophisticated data tools to clean, categorise and link. However, high levels of
consistency in the name fields are crucial if they are to be used in ONS data linkage processes. There
are three main ways in which the quality of the names variables could be optimised and these are all
being progressed:



feeding findings back to BF to help improve the data and BF are working towards improving
the demographic data at source via the introduction of Key Performance Indicators (KPIs) for
operators
developing better ways to process the name variables after they have been received from BF
to maximise consistency over time – for example, considering whether all generational
suffixes could be removed in case they are inconsistently applied
working with BF to understand the processes, rules and systems they have in place for
processing name variables to determine if these might meet statistical needs too
2.5 Conclusion on Data Quality
ONS previously suggested that, in theory, Semaphore could be used to measure migration directly
but the UK Statistics Authority6 and the government7 have provided reasons why this is not currently
possible. Our analysis of the Semaphore 2009-2012 extract, and information from BF about
improvements since 2012, support this view. However, the known statistical data quality issues of
duplication, coverage, within record variable completeness and consistency, do not prevent
Semaphore having some immediate utility and ONS have been keen to exploit these. The next
section looks at what ONS has been able to achieve so far.
6
http://www.statisticsauthority.gov.uk/reports---correspondence/correspondence/letter-from-sir-andrew-dilnot-to-bernard-jenkin-mpuksa-response-hc-523-06122013.pdf
7
http://www.publications.parliament.uk/pa/cm201314/cmselect/cmpubadm/1228/122804.html
11
Semaphore Research Update
3. What has ONS been able to use Semaphore for so far?
High-level aggregate counts of passenger flows from the 2009-2012 Semaphore data extract are
being used by an ongoing International Passenger Survey (IPS) Sample Review. Semaphore data are
particularly relevant to Port Coverage, Sample Optimisation and Out of Hours Air Traffic Coverage. In
this context, while duplication may exist, if these duplicates are evenly distributed, they will not
affect overall aggregate patterns or proportions. In addition, coverage and completeness at certain
ports of interest may be sufficient for the needs of the IPS Sample Review and since there is not a
requirement to link passengers over time for this, inconsistency in how the same individual’s
information is captured between different journeys does not necessarily affect the utility of the flow
data.
3.1 Background on IPS Sample review
The IPS collects information relating to travel and tourism between the UK and other countries, and
screens a larger sample of passengers entering and exiting the UK to identify migrants. A sample of
approximately 250,000 passengers is achieved per year for the travel and tourism sample and
approximately 800,000 people are interviewed as part of the migrant sift (around 4,000-5,000 of
these are migrants), using a sampling frame that covers 99% of the ports/routes in to and out of the
UK. Due to out of hours travel (passengers travelling at times when the IPS is not in operation) and
through ports not included in the sampling frame, the IPS sample is representative of approximately
95% of all passenger traffic to and from the UK with the “missing” 5% accounted for within the
weighting process.
A full-scale review of the IPS sample is being carried out to ensure the IPS sample continues to be
representative of UK passenger traffic and Semaphore offers a potentially valuable additional source
to feed into the review.
3.2 Review of Port Coverage
Semaphore data including aggregate flows of passengers by port and nationality are being used to
help assess whether any of the ports currently excluded from the IPS sample are potentially
important migration routes, or tend to attract overseas tourists from particular regions likely to be
of interest to IPS users such as VisitBritain.
3.3 Sample Optimisation
A sample optimisation for the IPS is designed to assess whether the overall sample profile needs to
be updated in order to ensure the quality of the migration and tourism estimates is maintained.
Semaphore data including flows of passengers by port, nationality and shift – morning, afternoon or
night – is being used in conjunction with other sources to identify the most effective combination of
shifts at all ports. This would provide robust data against the IPS’s three core information areas:
migration; earnings and expenditure information; and travel and tourism details. Extra information
provided by Semaphore may help ensure the sample design meets with stakeholders’ data quality
requirements and also that the optimal sample design is cost effective and takes interviewer
resource and operational constraints into consideration.
12
Semaphore Research Update
3.4 Review of Out of Hours Air Traffic Coverage
Currently the IPS covers approximately sixteen hours of passenger traffic over an ‘am’ and ‘pm’ shift,
typically 6:00am to 22:00pm at airports. Passengers arriving or departing on flights outside these
hours are not sampled, which is accounted for in the weighting system. Consequently, as a number
of flights are not covered this raises concerns that the IPS may not account for different types of
flights and passengers, and there may be considerable variations between airports/terminals and by
direction and quarter of the year.
Semaphore data including passenger flows by port, nationality and by arrival and departure time by
hourly slots will help inform or challenge the current assumption that passengers on certain routes
travelling in-hours are similar to those travelling on the same routes out-of-hours.
3.5 The IPS Sample review and optimisation 2014
This section has briefly outlined some of the ways in which Semaphore may help with the IPS Sample
review. The outcomes of the wider IPS Sample review project will be reported on separately through
the IPS Sample Review and Optimisation Project Board. The timetable of the Sample Review is
currently under review, but it is envisaged that the Sample Optimisation will be completed for
implementation in January 2016.
4. How can Semaphore improve migration statistics and what
improvements in data quality are required to achieve this
The Office for National Statistics (ONS) agrees with the UK Statistics Authority and government
position that Semaphore cannot be used to directly measure migration in its current form.
However, our work with the 2009-2012 Semaphore data extract, and collaboration with Border
Force (BF), have helped develop our understanding of what may be possible as the statistical quality
of the Semaphore data improves. The potential uses below are presented in order of how soon it is
likely any benefit could be realised.
4.1 Improving the International Passenger Survey
Section 3 explained how ONS is already using Semaphore data as part of ongoing International
Passenger Survey (IPS) improvement work. This will continue, using both the data we have, and
more up-to-date Semaphore extracts as we complete work with BF to acquire these (see section 5).
The main data quality requirement in this area is that coverage continues to increase. In the 20092012 extract, there is a lot of variation in coverage by port which limits how useful the data can be.
Current levels of duplication and within record completeness may not be a barrier in this instance.
Data consistency is also less important in this context because accurate linking of an individual’s
travel over time is not required.
It is worth reiterating that Semaphore cannot completely replace the IPS which collects different
pieces of information that are required by users of migration statistics. The IPS includes information
on reason for stay and also intended length of stay. While intended length of stay may not be
13
Semaphore Research Update
realised in subsequent actual behaviour, affecting accuracy, it is timelier because we don’t need to
wait and observe the behaviour.
Using Semaphore data in conjunction with the IPS is the most likely short term way in which
Semaphore data could be used to improve migration statistics.
4.2 Combining the data with other administrative sources
Across all topics, ONS is looking to increase its use of administrative data - that is data already
collected, often by government, for other reasons – in statistical production. As ONS has shown for
Semaphore data, there are often limitations in the statistical utility of an administrative dataset
precisely because the data are not collected for statistical purposes. However, by combining
multiple administrative sources that have different strengths and weaknesses, these limitations can
often be mitigated or overcome.
4.2.1 Using administrative data to measure the population, particularly at the local level
From a population statistics perspective, the ONS Beyond 2011 programme has been assessing the
different possible approaches to producing population and housing statistics in future.
Administrative sources currently being researched by the programme include those held by the
Department of Health, the Department of Work and Pensions, HM Revenue and Customs, the
Department for Education and the Higher Education Statistics Agency.8
As a result of the programme’s research, in March 2014, the National Statistician recommended a
predominantly online census in 2021 supplemented by the further use of administrative and survey
data. This approach has been approved by government.
As research using administrative data to measure the population continues, how immigrants first
appear on administrative sources, and if and how emigrants are removed from those sources, will be
important. For example, in terms of immigration, when a patient registers with a new GP for the
first time, they are asked whether they are from abroad and when they first arrived in the UK.
Capturing emigration is likely to be more challenging but inactivity on employment related
databases may be one possibility. Semaphore may have a role to play in quality assuring findings
from other administrative sources, for example checking whether an individual identified as a
potential immigrant or emigrant on another source, has also been recently recorded entering or
leaving the UK by Semaphore.
For Semaphore data to be used in this way, coverage will be important but not necessarily crucial
because the other sources may cover individuals missing from Semaphore. Duplication for an
individual journey may also not affect the utility of the data if the initial migrant identification is
driven by another source where duplication has been easier to resolve. However, complete and
accurate data within a record will be important because accurate linkage of individuals across
datasets would be required.
8
A full list can be found in Annex C of this report: ‘Beyond 2011: Safeguarding Data for Research: Our Policy', July 2013‘
14
Semaphore Research Update
4.2.2 Using administrative data to measure international migration, particularly at the
national level
The initial focus of the Beyond 2011 programme research has been investigating ways to estimate
the population by age and sex for sub-national areas (e.g. local authorities). Confirming whether a
record (for a person at an address) on an individual administrative source should contribute to a
population estimate for a given area will be an important part of ongoing research. If those
migrating internationally can be distinguished from those migrating elsewhere within the UK, it may
lead to statistics about flows of people as well as about people in a particular area.
However, setting out to better measure international migration specifically, particularly at a national
level, may lead to a different methodology or combination of sources. To explore this fully and
deliver results quickly would require a separate programme of work. Semaphore data would be an
important source in such research.
An important measure of success for any method developed would be whether it can deliver more
accurate estimates than the IPS. It is worth noting that under current immigration policy, finding a
combination of administrative sources that achieves this will be more realistic for non-EU migration
than EU migration. Visa and other data are generated for non-EU migrants by mandatory
interactions with the immigration system and these data may be required to account for limitations
in other administrative sources such as Semaphore.
Even in the case of non-EU migrants, a high match rate between Semaphore and visa data would be
required to confirm entry and exit dates of those with a visa. Semaphore would need complete or
near complete coverage, and a high level of within record completeness and accuracy to maximise
the chances of success.
4.3 Estimating the population present in the UK
If all flows in and out of the UK were accurately recorded by key characteristics such as nationality,
age and gender, this information would provide insight into the population present in the UK. This
would provide an indirect picture of migration even if individual travel histories were not linked over
time.
A count of arrivals and departures, and therefore net in/outflow by nationality, will include a large
number of visitors, and a much smaller number of short and long-term migrants. However, visitors
will leave after a short stay. Therefore, rolling counts starting from any given reference point could
potentially provide information on any long-term migration trends via discrepancies between in and
out flows. Such an approach would compare two enormous flows of close to 100 million movements
in and out of the UK per year and even a very small degree of inaccuracy in such large counts could
magnify the net effect implied by differences between such large underlying numbers. For example,
a bias in one flow of just 0.1% would add 100,000 to the annual difference between flows.
Although less ambitious than travel histories, statistics of this nature would require complete or near
complete coverage, very low levels of duplication, and complete and accurate data for the
characteristics of interest (for example, nationality). This is because on average, approximately 1 in
every 200 passenger movements is a migration event, so any differences between in and out flows
15
Semaphore Research Update
caused by limitations or statistical bias in the data (for example, a port that is more common for
departures than arrivals not being covered) could mask any real world differences.
4.4 Travel histories
If travel into and out of the UK were accurately recorded once per person per journey, it would in
theory be possible to link people’s travel patterns over time. For a proportion of individuals, this will
already be possible. However, there are many barriers to translating these travel histories into an
estimate of migration, especially one we could be confident was more accurate than the IPS.
Estimation of international migration would depend on identification of individuals where there is a
record of entry but not departure (immigration), or a record of departure but not return
(emigration), within a specified period (12 months for long-term migration estimates according to
the UN definition used by ONS). As described in section 4.3, approximately 1 in 200 passenger
movements is a migration event. If even a small proportion of the other 199 in 200 movements
were misclassified as a migration event, the precision of any migration estimate based on such a
method is unlikely to be more precise than the IPS. This misclassification could happen in a number
of ways, and taking immigration as an example:



without complete coverage, an individual may enter via a route that is covered but leave via
a route that is not, inflating any estimate of immigration.
if there were two or more records for an individual entering the UK but only one when they
left, and the duplicates could not be easily identified (or all paired with the exit record), the
additional entry record could be taken as evidence of additional immigration (in error)
if there was a failure of the longitudinal linking of an individual’s travel data in creating their
travel history, such that their entry and exit records were not paired together, their entry
record could inflate any immigration estimate. It is noted in this case there would also be an
incorrect but compensating inflation of any emigration estimate caused by their unpaired
exit record. Because individual document numbers can change (for example, when a
passport is renewed, when a name could also change e.g. following marriage), and dual
nationals can use different passports on entry and exit, high levels of completeness,
accuracy and consistency over time of demographic information
There would also be an issue over timeliness to overcome because the above process relies on
assessing an individual’s actual behaviour during the 12 months after they enter / leave the UK (for
long-term migration estimates), not their intentions. The estimate for any given reference period
would require some form of provisional estimation to prevent a lag of over 12 months between the
end of the reference period and release of the estimates. These provisional estimates would then
be revised as more data about actual behaviour since the reference period became available. At
present, ONS publish migration statistics with a 5-month lag between the reference period and the
publication date.
4.5 Potential uses not directly related to migration
Short term: In the IPS travel and tourism sample, there are a higher number of UK residents and
overseas residents starting their visit than ending their visit. This is important because the IPS
16
Semaphore Research Update
overseas travel and tourism estimates are only based on the number of completed visits (i.e. they
are calculated from data collected only from interviews with UK residents as they arrive back in the
UK and with overseas residents as they depart the UK at the end of their visits). Imbalance may
result in visits and spending being under-estimated. Semaphore data can provide passenger flows by
arrival and departure airport and nationality so may help inform whether the current weighting
applied to correct this imbalance needs adjustment.
Long term: Reliable total flows by important characteristics such as nationality and port of entry (as
per section 4.3) may allow new statistics about travel and tourism. In this case, because separating
out the small number of migrants from everyone else is less important, the statistics would be more
robust to any error caused by unresolved duplication or incomplete records. However, improving
coverage would still be important because different nationalities will visit the UK via different
methods / ports, so any gaps in coverage will lead to bias in the gaps within the data.
5. Next Steps
More research is planned by the Office for National Statistics (ONS) to ensure potential benefits are
explored and realised as soon as the Semaphore data are of sufficient statistical quality. Specifically:
5.1 Future Semaphore data extracts
ONS have begun the process to request more frequent, smaller extracts of Semaphore data. We will
use these extracts in combination with regular advice from Border Force (BF) colleagues to monitor
data quality and determine when to research any given potential improvement further.
5.2 Regular communications with BF officials
ONS have benefitted from working closely with BF to ensure we have a firm understanding of the
data system and processes. BF officials better understand how changes to collection and processing
may affect, and hopefully improve the statistical utility of the data. ONS and BF will establish a
regular working group to formalise this information sharing process, ensuring statistical benefits are
considered alongside operational benefits as the Semaphore system develops.
5.3 Refining ONS’ de-duplication strategy
Work is planned to create a more robust de-duplication strategy guided by BF advice. This may also
improve other indicators of statistical quality. For example, where two records are incomplete, it
may be because they relate to the same person on the same journey and are missing different
pieces of information in each case. Once this duplication is resolved, and the records combined,
instances of incomplete fields will decrease.
5.4 Regular user updates
ONS will publish regular updates and users will continue to be updated via events such as the annual
Migration Statistics User Forum conference. At the most recent MSUF conference (September 2014),
it was evident that users are still interested in the outcome of this work.
The planned work above will address the following research questions:
17
Semaphore Research Update
1. How has data quality improved as new data extracts are received and analysed?
2. How useful has the Semaphore data been in the ongoing IPS improvement work? What have
been the limitations? Can this information be used to be more specific about the quality
improvements required to allow Semaphore to be permanently integrated into an IPS
optimisation process?
3. Increased coverage is fundamental in facilitating all the potential benefits and this can only
be increased by BF. In the meantime, can BF and ONS develop techniques that maximise the
other aspects of statistical quality?
ONS remains committed to improving population and migration statistics and Semaphore is just one
area being investigated. The ONS Population Statistics Research Unit (PSRU) continues to explore
several other improvements and regular updates on these are published on the ONS website. Any
questions on this report should be directed to the PSRU inbox.
18
Semaphore Research Update
Annex A – Semaphore variables received
Summary of Semaphore variables received, if they are pseudonymised by ONS and reasons included in extract
Document related
variables
Person related variables
Forename
Yes
X
X
Middle names
Yes
X
X
Surname
Yes
X
X
Date of birth
Yes
X
X
X
X
X
X
Sex
Nationality
Document number
Other
Reason included in extract
Key output
variable
Pseudononymised
by ONS
Facilitate
entry and
exit linkage
Person
centric
travel
histories
Variable name
X
Yes
X
Document type
X
Country of issue
X
X
Issue date
X
Expiry date
X
19
Semaphore Research Update
Flight related variables
Other
Key output
variable
Person
centric
travel
histories
Reason included in extract
Carrier
Validate route information, investigate quality by carrier
Flight code
Validate route information
Arrival date and time
X
X
Date sequencing
Departure date and time
X
X
Date sequencing
Arrival airport
X
Departure airport
X
Embark airport
X
Disembark airport
X
OPI (booking) code
Other
Pseudonymised
by ONS
Facilitate
entry and
exit linkage
Variable name
X
X
Linking individuals travelling together
Passenger type
Remove flight crew
Event type
Ascertain data download point
20
Semaphore Research Update
Glossary of terms
Advance Passenger Information (API)
Border Force (BF)
Advance Passenger Information (API) is
information from an individual’s passport
required by the government of the country being
travelled to.
Border Force is a law enforcement command
within the Home Office. It secures the UK border
by carrying out immigration and customs
controls for people and goods entering the UK.
Border Force was formed on 1 March 2012 as a
law enforcement command within the Home
Office. Border Force secures the border and
promotes national prosperity by facilitating the
legitimate movement of individuals and goods,
whilst preventing those that would cause harm
from entering the UK. This is achieved through
the immigration and customs checks carried out
by our staff at ports and airports.
International Passenger Survey (IPS)
The International Passenger Survey (IPS) collects
information about passengers entering and
leaving the UK, and has been running
continuously since 1961. The IPS conducts
between 700,000 and 800,000 interviews a year
of which over 250,000 are used to produce
estimates of Overseas Travel and Tourism.
Interviews are carried out at all major airports
and sea routes, at Eurostar terminals and on
Eurotunnel shuttle trains.
The survey results are used by various
government departments, including the Office
for National Statistics (ONS), the Department for
Transport, the Home Office, HM Revenue and
Customs, VisitBritain and the national and
regional Tourist Boards.
The results are primarily used to:



21
measure the impact of travel
expenditure on the UK economy
estimate the numbers and
characteristics of migrants into and out
of the UK
provide information about international
Semaphore Research Update
tourism and how it has changed over
time
International Civil Aviation Organization (ICAO)
The International Civil Aviation Organization
(ICAO) is a UN specialized agency, created in
1944 upon the signing of the Convention on
International Civil Aviation (Chicago
Convention).
ICAO works with the Convention’s 191 Signatory
States and global industry and aviation
organizations to develop international Standards
and Recommended Practices (SARPs) which are
then used by States when they develop their
legally-binding national civil aviation regulations.
There are currently over 10,000 SARPs reflected
in the 19 Annexes to the Chicago Convention
which ICAO oversees, and it is through these
SARPs and ICAO’s complementary policy,
auditing and capacity-building efforts that
today’s global air transport network is able to
operate over 100,000 daily flights, safely,
efficiently and securely in every region of the
world.
Long-term International Migrant
A person who moves to a country other than
that of his or her usual residence for a period of
at least a year (12 months), so that the country
of destination effectively becomes his or her
new country of usual residence.’
- United Nations (1998)
Public Administration Select Committee (PASC)
PASCs examine the quality and standards of
administration within the Civil Service and
scrutinises the reports of the Parliamentary and
Health Service Ombudsman.
A single travel event can be a departure or an
arrival to or from a UK travel port which is
recorded on the Semaphore system.
The UK Statistics Authority is an independent
body operating at arm's length from government
as a non-ministerial department, directly
accountable to Parliament. It was established on
1 April 2008 by the Statistics and Registration
Service Act 2007.
Travel event
UK Statistics Authority
The Authority's statutory objective is to promote
and safeguard the production and publication of
22
Semaphore Research Update
official statistics that serve the public good. It is
also required to promote and safeguard the
quality and comprehensiveness of official
statistics, and ensure good practice in relation to
official statistics.
The UK Statistics Authority has two main
functions:
1. oversight of the Office for National
Statistics (ONS) - its executive office
2. independent scrutiny (monitoring and
assessment) of all official statistics
produced in the UK.
Membership of the Authority's Board comprises
the Chair of the Authority, seven other nonexecutive members, and three executive
members.
23
Download