Drowning in data: who and what can we trust ? Summary

advertisement
Drowning in data: who and what can we trust1?
David Rhind, Chair of APPSI
Summary
This paper takes a wide view of the issues arising from the advent of ‘Big Data’. It distinguishes
information and data and separates data into three groups based on custodianship because these have
different implications for data exploitation. The three categories are Open Data (and the National
Information Infrastructure), personal data held by government and data held as a commercial asset.
The paper then addresses five issues: threats to privacy and the trade-offs involved with public
benefits, the appropriate role of the state, the consequences of technology failure, the dearth of UK
quantitative skills, and misunderstandings or misrepresentations of data. Central to minimising threats
to privacy and maintaining public trust are good governance, regulation and independent ‘fact
checking’. Finally, guidance on how to assess evidence or conclusions derived from data and
information is given.
1.
From data famine to information feast?
In 1798 Thomas Malthus predicted population growth would increase faster than the availability of
food, with catastrophic consequences. But, 216 years later, Malthus’ apocalyptic prediction has not
yet been realized (though nearly one in eight of the world’s population are suffering from chronic
undernourishment2). The science of crop breeding, new technologies and improved management
practices have been the main contributors to averting the cataclysm.
In a later parallel, librarians argued in 1944 that the growing volume of publications would outstrip
their capacity to store it. This was based on the observation that US university libraries were then
doubling in size every 16 years. What has happened is that the data and information collected by
governments, private sector, the voluntary sector and individuals have shifted dramatically from being
collected and held in analogue (e.g. paper) form to being stored digitally. In 1986 around 99% of all
information was stored in analogue form; by 2007 some 94% was being stored in digital form3.
Moreover the total amount of data has grown exponentially: it has been estimated that more data was
harvested between 2010 and 2012 than in all of preceding human history4. This increase owes much
to new technologies of data collection, notably video recordings, plus sensors built into satellites,
retail outlets, refrigerators, etc and of processing and dissemination. Included are the results of many
surveys reported daily in the media, some of which give minimal details of how the (often
questionable) results were obtained. Much of this informs decisions and shapes attitudes and opinions.
What of it can we trust and how do we know?
1
This paper formed the basis of the University of Bath Founder’s Day Lecture, given by the author on 2 April
2014. Neither the University of Bath or the Government’s Advisory Panel for Public Sector Information
(APPSI) bear any responsibility for the views expressed.
2
UN Food and Agriculture Organisation estimate for 2010-2012.
3
Hilbert M and Lopez P (2011) The world’s technological capacity to store, communicate, and compute
information Science 332 no. 6025 , 60-65
4
This claim is normally attributed to IBM but see http://www.bbc.co.uk/news/business-17682304. Much of the
data is user-collected (e.g. videos and photographs, emails, and social media traffic including tweets).
2.
The nature of data and information
In what follows it is helpful to know that data and information have particular characteristics. These
include:
3.

Conventionally we use a terminology based on a hierarchy of value rising from data to
information to evidence to knowledge to wisdom. Data are numbers, text, symbols, etc that
are in some sense value-free e.g. a temperature measurement. Information is formally
differentiated from data by implying some degree of selection, organisation and relevance to a
particular use. i.e. multiple sets of information may be derived by selections from and analysis
of a single data set. In practice, however, the distinction between data and information is often
blurred in common usage e.g. ‘Open Data’ and ‘Public Sector Information’ as used in
government publications frequently overlap or are used synonymously.

Even though the initial cost of collecting, quality assuring and documenting data may be very
high5, the marginal cost of providing an additional digital unit is close to zero.

Use by one individual does not reduce availability to others (termed non-rivalry).

In some cases individuals can not be excluded from using the good or service (termed nonexcludability)

Linkage of data can generate many more possible uses. Thus if 20 different sets of
information are collected for all of the Local Authorities (LAs) in Britain, there will be 190
pairs of information and over 1 million combinations in total for each LA, some of which may
reveal new knowledge.

Unlike fuel, data and information are not consumed by use. They may get out of date (like
census information) but rise again in value in comparisons with new information.
Exploiting the data mountain
Mountains of data are of little value unless they can be transformed into information and used, often
for a myriad of initially unforeseen purposes. The advent of the internet and then the world wide web
in the 1980s and ‘90s has transformed human experience and business practice: it is now routine to
find information lodged in distant lands, often in a few seconds, and share it with others or analyse it,
frequently in conjunction with other information. Moreover it is now also routine to go beyond
analysis of the ‘what is’ situation. ‘What if’ analyses and modelled projections and forecasts are
becoming commonplace – such as the prediction by retailers like Amazon of what you are likely to
buy - and even the locations of future crimes.
It is convenient to view information for present purposes as sub-divided into three parts based on their
custodianship: data created as part of the National Information Infrastructure (NII) and often made
available as Open Data; personal data held by the state; and personal and other data held by
commercial bodies. In practice, the categories are not distinct but lie on a spectrum.
3.1 Non-personal data as national infrastructure and Open Data
The importance of maintaining a sound national physical infrastructure - roads, railways, utilities and
so on - is well-recognized. Thus the interstate highways of the USA transformed commerce,
employment and recreational travel from the 1950s on. In times of recession, many governments
across the world see suitable 'shovel ready' infrastructure projects as a means of stimulating the
economy and generating long term efficiencies. In essence, this amounts to building and exploiting a
5
The 2011 UK Census of Population data cost approximately £480m to plan, collect and process (one of the
lowest per capita costs in the world for such surveys) but data dissemination costs were minimal.
2
National Infrastructure6. What we are seeing now is the emergence of a new National Information
Infrastructure7. This is not coherent – like Topsy, it has simply ‘growed’ – in this case from legacy
information needed by the state to carry out its roles or to generate revenue. There is a good case –
which UK government seems to have accepted – for now regarding this more strategically by thinking
about what information is really needed for the future and how technology could help provide it.
Until recently, governments have been the largest ‘owners’ of data about their people, their
economies, and physical phenomena like the shape of the land surface, the weather and climate and
the geography of transport networks. Some government actions in the information domain have been
of immense public benefit far outside their national boundaries. The obvious example is the creation
of the Global Positioning System (GPS) by the US government It has under-pinned a whole new
global location services industry valued in 2013 by the economic consultancy Oxera at between $150
and 270 billion per annum, as well as European and Russian competitors.
For governments, making the non-personal information they hold easily and publicly accessible has
the potential to provide four positive outcomes. These are enhanced transparency and government
accountability to the electorate, improved public services, better decision-making based on sound
evidence, and enhancing the country’s competitiveness in the global information business.
All this is part of the Open Data agenda embraced enthusiastically by the current government and its
predecessor. Some 13,000 government data sets are now accessible via data.gov.uk8 and can be used
for any purpose – personal, communal or commercial - under the Open Government Licence9. The
same agenda has been picked up by some local authorities and many other public bodies. It has also
become an international growth industry, with the the Open Data charter10 signed by the leaders of
Canada, France, Germany, Italy, Japan, Russia, the UK and the USA at the G8 meeting in June 2013.
This commits those nations to a set of principles (‘Open by default’, ‘Usable by all’, etc), to a set of
best practices (including metadata provision i.e. data describing the main data set classifications, etc)
and national action plans with progress to be reported publicly and annually. Some 60 countries have
also come together in the Open Government Partnership - under-pinned by Open Data.
3.2
Personal data and public benefit
Much government data is collected about individuals and used for administrative purposes or for
research which helps shape policy. Subsequently such data may be aggregated to produce
information about geographical areas or sub-groups of the population – and increasingly this is then
made available as Open Data. Following a report by Thomas and Walport11, the UK government has
recognised the utility of using such personal data12 for purposes other than those for which they were
originally intended, notably to reduce costs and provide more up to date statistical and other results
(Section 4.1). As a result, they have funded through the Economic and Social Research Council
(ESRC) a set of Administrative Data Research Centres (ADRCs) in each country of the UK and an
ADR Network (ADRN). These are designed to enable carefully controlled access to a variety of
6
https://www.gov.uk/government/collections/national-infrastructure-plan
Promulgated by the Advisory Panel on Public Sector Information (APPSI) since 2010, this has now been
widely adopted as a concept e.g. https://www.gov.uk/government/publications/national-informationinfrastructure
8
Though a substantial fraction were previously available from a multiplicity of government web sites.
9
Measuring the real success of the initiative is difficult, as the Public Administration Select Committee has
pointed out http://www.publications.parliament.uk/pa/cm201314/cmselect/cmpubadm/564/564.pdf
10
https://www.gov.uk/government/publications/open-data-charter/g8-open-data-charter-and-technical-annex
11
http://webarchive.nationalarchives.gov.uk/+/http:/www.justice.gov.uk/docs/data-sharing-review.pdf
12
Personal data as described by the Data Protection Act of 1998 is data that relates to living individuals who
are or can be identified from the data. Organisations that want or need to share, disseminate or publish their data
for secondary use are obliged under the DPA (1998), unless exempt, to transform the data in such a way as to
render it anonymous and therefore no longer personal. (see http://ukanon.net/key-information/)
7
3
detailed administrative data for academic research or government policy purposes. The ESRC –
funded bodies are accountable to Parliament through the UK Statistics Authority.
Carefully controlled access within ‘safe havens’ to linked personal data can produce major public
benefits. Many of the benefits are not available through analysis of aggregate data, such as those for
large population (e.g. ethnic) groups or for populations in geographical areas. Potentially beneficial
uses of such individualized data include:

Reducing the burden on people or businesses to fill in multiple government questionnaires,
with data collected for one purpose being spun off from databases originally collected for
another.

Ensuring that wherever an individual travels, his or her detailed health records could be
available to any doctor in the event of a sudden illness or accident.

Tracking the life histories and geographies of individuals to study correlations between, for
example, exposure to environmental hazards and later illnesses

Knowing where people are in real time and hence the location of potential victims in the event
of natural disasters. Knowing the movements of crowds in real time from mobile phones can
greatly help police to manage crowds safely.

Studying inequality in society through analysis of life experiences by people in small,
specifically defined groups (e.g. population cohorts, local communities or ethnic groups),
from which new policies and actions might flow.

Reducing the incidence of fraud by merging multiple administrative datasets together (e.g. tax
records and social security benefits) and seeking anomalies.

Profiling people on the basis of their background, personal characteristics, DNA or contacts
for setting insurance premiums, marketing or credit-rating purposes – or, very controversially,
to detect those predisposed toward acts of crime or terrorism.
The potential benefits of linkage and sharing of personal information are most obvious in health but
exist in many other domains. As indicated above, joining together information held in hospitals and
that by GPs about the same person, as proposed under the NHS Care.data plans, can enable more
‘joined up’ care and also facilitate research very likely to lead to long term public benefits. In part
this is because the UK has a strong competitive advantage through the national coverage of the NHS
and resulting large volumes of health data being more comparable than elsewhere. Yet the six month
delay in implementation of Care.data triggered by antipathy from GPs and other parties13 - despite
such data linkage operating in Scotland - demonstrates the need for extraordinary efforts to secure
public trust before launching schemes based on use of linked data about each and every person14.
3.3
Data as a commercial asset
In some respects the government primacy in data holdings has changed dramatically. Commercial
enterprises now routinely collect and exploit vast amounts of information about their individual
customers, notably through loyalty cards or through customer agreements (such as for loans).
Profiling of their individual customers helps target sales offers and increase favourable responses by a
factor of ten in many cases. Google maps are now embedded in around 800,000 corporate web sites in
Britain and have much private and public sector data linked to its geography. Substantial amounts of
data are also collected through crowd-sourcing – such as the world-wide Open Street Map which has
13
See. for example,
http://www.telegraph.co.uk/health/healthnews/10656893/Hospital-records-of-all-NHS-patients-sold-toinsurers.html, http://www.theguardian.com/society/2014/feb/21/nhs-plan-share-medical-data-save-lives and
http://www.theguardian.com/commentisfree/2014/feb/28/care-data-is-in-chaos
14
http://www.hscic.gov.uk/media/13726/Kingsley-Manning-speechHC2014/pdf/201314_HSC2014_FINAL.pdf
4
over a million users and tens of thousands of volunteer contributors.
One consequence of all this is the commoditization of some data. What was once a resource highly
valuable because of limited access or restrictive practices followed by governments is now
increasingly a low cost commodity. This – and increasing competition in data supply, some global has big implications for organizations trading in data. To survive they must add value, often by
providing expert services using the data and ‘co-mingling’ it to provide a unique offering. A good
example of this is the Climate Corporation15. This is a US start-up founded in 2006 by two former
Google data scientists. It has combined 30 years of weather data, 60 years of crop yield data, and 14
terabytes of soil data—all free from US government agencies. The services offered include yieldforecasting to help farmers make decisions such as where and when to plant crops in order to boost
productivity, plus weather and crop insurance to help manage risk. The 'precision agriculture' firm
was acquired by Monsanto, the world's largest seed company, for $930m last October. The success of
Climate Corporation seems to owe much to ‘first mover’ advantage rather than guarding intellectual
property rights – a contrast to many other information traders (including some government bodies).
3.4
The new information ecosystem
In summary, there has been the collection of unprecedented volumes of accessible data, the
emergence of massively more powerful analytical tools to convert data into information and the
means to transfer the latter near-instantaneously. This is often encapsulated in the term ‘Big Data’.
Technologically-driven enthusiasm has been misplaced before:
In 1858 a Victorian said: “It is impossible that old prejudices and hostilities should longer exist,
while such an instrument has been created for the exchange of thought between all the nations
of the Earth16”. He referred to the laying of the first submarine trans-Atlantic telegraph cable.
Yet if we ignore the ‘hype’ frequently associated with the Big Data term, these developments taken
together constitute a new information ecosystem. This provides an opportunity to enhance how we
operate for the public good and also to build commercial benefit.
There is increasing acceptance in the UK and beyond that Open Data – comprising non-personal data
- is ‘a good thing’17 and hence to be welcomed. But the attitude to personal data about individuals is
another matter, especially that held by government. The purposes for which such personal data are
proposed to be used, the nature of the engagement with the affected individuals and the consent
mechanisms in place can materially affect public trust in the data custodians and the whole of
government. I now address the main problems in securing the benefits of the new technologies and
data.
4.
But what about the problems?
The developments described above have already raised the following serious issues:

Real or perceived threats to individual privacy, sometimes through pursuit of a wider public
good.

The appropriate role of the state.

Consequences of technology failure.
15
David Kesmodal, “Monsanto to buy Climate Corp. for $930 million, Wall Street Journal, October 2, 2013
Cited in The Economist 17 August 2000
17
But see the PASC report
http://www.publications.parliament.uk/pa/cm201314/cmselect/cmpubadm/564/564.pdf
16
5

The relative dearth of quantitative skills in the population and its impact on the creation and
use of evidence.

The dangers posed by a combination of a data-savvy elite and gross inadvertent or other
misrepresentations of scientific findings in the media or by politicians.
All this leads inexorably to the fundamental question of how does the lay public know who and what
information to trust? I now consider each of the issues in turn before returning to this question.
4.1 Threats to individual privacy, sometimes through pursuit of a wider public good
Our personal right to privacy has in practice become sharply undermined since the attack on the Twin
Towers in New York in 2001. This event, together with the new technologies described above, have
led to a global increase in surveillance, capturing the movements, characteristics and conversations of
people on a massive scale (as Snowden revealed). Understandably this has given rise to serious
concerns about privacy infringements, the role of the state and the trade-off between privacy and
public benefit.
We should distinguish between the different bases on which public and private sector bodies collect
and hold personal information. That collected by government bodies is very often mandated by law:
citizens have no right to refuse but benefit from government using the data in planning facilities or
supplying social services (as well as collecting income-based taxes). In contrast, individuals
voluntarily - and often happily - provide personal information to many private sector bodies (e.g.
retailers) for gain such as discounts on purchases. There is some opposition to the consent mechanism
which normally consists of ticking a single box to sign up to lengthy contracts containing much small
print. Yet this is entirely legal.
Attitudes to privacy may be context-dependent. For instance, parents concerned about the safety of
their children might well contemplate giving them a smartphone with a location-reporting app. An
injured hill walker or a wounded soldier would be happy for the rescue services to know their
location. In these latter examples, an element of privacy is being traded off to ensure enhanced safety
at some moment in time.
We should also note that what constitutes privacy and what information is considered appropriate for
the state to hold and use in combination varies greatly in different cultures. Nordic countries operate
a series of linked registers to provide large amounts of administrative data about individuals for
multiple government purposes. In general, three base registers exist containing details of each ‘data
unit’. These are a population register, a business register (enterprises and establishments) and a
register of addresses, buildings and dwellings. A unique identifier for each ‘data unit’ in the base
registers, and links between them, is used. The unique identifier is also used in many other
administrative registers for these ‘data units’ such as educational registers and taxation registers. An
extract from the population register serves as a basic document that is needed when applying for a
passport, getting married or divorced, or when a funeral is held or an estate is distributed.
The privacy issue is essentially about whether data concerning an identifiable individual - held by
governments or private sector bodies - are improperly disclosed. The extreme approach to
safeguarding such personal data is that proposed in 2013 by the European Parliament for the basis of a
revised EU Data Protection Directive: that on every occasion where such a data set is to be used for a
new purpose, permission for each record to be re-used should be obtained from the individual
concerned. This would devastate research and societal benefit in the health domain and in many
others.
Central to any proper use of personal data by third parties is adherence to relevant laws or protocols
plus application of sound methods which anonymize the records. The UK Anonymisation Network
publishes best practice in anonymisation and offers practical advice and information to anyone who
6
handles personal data and needs to share it18. It is widely recognised that the creation of a single
unique identifier for each citizen, applied consistently across all datasets from the outset, has the
potential to allow unauthorised re-identification and could compromise individuals' privacy. An
approach has therefore been developed based on attaching unique identifiers to data shares on a case
by case (or project by project) basis, where the identification algorithms and identifiers are destroyed
once matching and linking have taken place. This gives assurance within each matched dataset that
the data relates to the same data subject, but significantly reduces the risk of subsequently matching
the data to other sources without the authority of the data controller. The approach provides strong
controls on how data are used, and provides transparency for the citizen so that they can hold data
controllers and statisticians to account.
There are well-proven examples where successful and safe use has been made of data relating to
individuals using such safeguards. For instance the Scottish Longitudinal Study19 (SLS) is a largescale linkage study created using data from multiple administrative and statistical sources and
benefiting from the integrated Scottish governmental structure. The data include detailed population
census data from 1991 onwards; vital events data (births, deaths, marriages); the NHS Central
Register data (which gives information on migration into or out of Scotland); and education data
(including Schools Census and other data). The SLS is a sample of 274,000 individuals from the
Scottish population using 20 random birthdates. All data are anonymized and strict procedures for use
of the data are in place, with each proposed analysis requiring approval by ethics committees and
oversight by senior individuals i.e. there is a robust form of governance in place. The result is a series
of valuable studies which could not otherwise have been carried out and which have led to important
changes in policy and practice: the rates of blindness and amputation amongst diabetes sufferers in
Scotland have, for instance, been dramatically reduced in part by early identification of those at risk
through data linkage. It is also relevant that no cases are known of any leakage of personal data from
censuses in the last century (the entirety of the personal data for every individual is released under
statute 100 years after it is collected).
The fundamental question is whether, despite all the techniques cited above, any anonymized personal
data can be de-anonymized? As always, there are some trade-offs: accepting some risk can improve
the benefits. A major contribution to anonymization is to remove all indicators of geography yet this
usually severely limits the value of the data. In addition, and by using fuzzy logic techniques, it may
in some cases be possible to identify some individuals by bringing together two previously
anonymized data sets. Experts have concluded that it is probably impossible to guarantee that
anonymization can always work for all time by technical means alone. However in well designed and
operated systems with good governance structures, access through ‘safe havens’, and criminal
sanctions against mis-use (see section 5) such failures should be very rare indeed whilst the potential
social benefits could be very large.
There is some evidence that members of the public understand the trade-offs and support data
linkage20. It is worth noting that, once ‘captured’ in any governmental and some commercial systems
it is difficult to escape. Moreover in many cases there will always be a record that you once existed on
the system. There is no universal right to choose to be forgotten!
4.2 The appropriate role of the state
As indicated earlier, the nation state (and local governments) remain central to creating data and
extracting much utility from derived information. But the role of the state matters beyond the handling
18
http://ukanon.net/
http://sls.lscs.ac.uk/about/
20
http://www.scotland.gov.uk/Publications/2012/08/9455/0, http://www.ipsosmori.com/DownloadPublication/1652_sri-dialogue-on-data-2014.pdf
19
7
of personal data. Here my focus is on commercial exploitation of information assets held by
government.
Most government bodies collect data only to meet their ‘public task’. What they make available
thereafter can be considered ‘exhaust data’, the great bulk of which is made available free or at
marginal cost (usually near zero). However the UK operates a ‘mixed data economy’: there is a group
of public bodies which are charged with covering their costs through commercialisation of data or
services based upon the data they collect and hold. Of the 20 or so Trading Funds, Companies House,
the Met Office, Land Registry and Ordnance Survey all generate very substantial proportions of their
costs by such means (though all have been persuaded to make some of their information available as
Open Data). Other bodies, such as the Environment Agency, make charges for commercial use of
their (e.g. flood) data. The most controversial example of the commercial/Open Data tension is Royal
Mail which created detailed data about all addresses across the UK whilst a publicly owned
corporation. Royal Mail claimed that its privatisation was at risk if it did not retain ownership of and
exploitation rights to this data and politicians acquiesced - despite many arguments to the effect that
this was a critical core data set, part of the National Information Infrastructure and should be Open
Data. Parallel considerations are (at the time of writing) under consultation in regard to the possible
privatisation of Land Registry21. The Public Administration Select Committee has strongly criticised
such sales of information assets22, arguing key information assets should be retained in public
ownership. Leasing them out may be a much more appropriate approach.
But public ownership alone does not obviate other problems. For example there is a real danger of
quasi-monopolies enjoyed by government bodies exerting their market position to squeeze out
competition from the private sector through an array of techniques. This might very well lead – at the
very least - to a reduction in public trust in such bodies but would also potentially damage the growth
of competitors and thereby reduce benefits to the public. Robust regulation of Trading Fund activities,
mainly by the Office of Public Sector Information and the Office of Fair Trading, is therefore of
critical importance.
A further source of potential problems is the creation of commercial monopolies through outsourcing
of tasks carried out previously within government. A number of cases have occurred where data
collected to carry out the task was claimed to be owned by the contractor and hence not capable of
being made available as Open Data. This has been recognised and government is increasingly
employing procurement contracts which obviate the issue. Indeed there are some signs of very
positive approaches by a few commercial bodies in working with public bodies, notably by Elgin23
which provides a national service about the location and other details of roadworks assembled from
disparate local data sets.
4.3 Causes and consequences of technology failure
We take the reliability of electronic devices, especially consumer items like smartphones and
embedded Global Positioning Systems (GPS), to be a given. Yet that is not necessarily valid. Modern
technological societies are characterized by a complex interweave of dependencies and
interdependencies among its critical infrastructures. Any significant failure, especially to electricity
supplies, can have cascading consequences.
Perhaps the biggest risks are in regard to civilian use of GPS. These arise from both human and
natural hazards - and it matters. The European Commission estimated in 2011 that 6-7% of GDP in
Western countries, i.e. € 800 billion in the European Union, is dependent on satellite radio navigation.
21
https://www.gov.uk/government/uploads/system/uploads/attachment_data/file/274493/bis-14-510introduction-of-a-land-registry-service-delivery-company-consultation.pdf
22
http://www.parliament.uk/business/committees/committees-a-z/commons-select/public-administration-selectcommittee/news/open-data-substantive/
23
http://www.elgin.org.uk/ and http://roadworks.org/
8
Use of GPS for car guidance and maritime navigation (especially in busy areas like the English
Channel) is ubiquitous; sudden loss or modification of the signals (e.g. by criminals) could be
catastrophic. Moreover a Royal Academy of Engineering report24 pointed out that the time signals
from the GPS system in particular have found application in managing data networks and mobile
telephony. Civilian GPS signals may be readily blocked over small areas by devices readily purchased
on the internet. North Korea is believed to use Russian devices to jam GPS signals over much wider
areas of South Korea. The UK government and others are seeking to build greater resilience into these
systems: that should be aided by the European Galileo navigation system but the threat remains very
serious25.
In addition, the UK national civilian risk register26 identifies space weather as a major threat. A
variety of technologies are susceptible to the extremes of space weather - severe disturbances of the
upper atmosphere and of the near-Earth space environment that are driven by the magnetic activity of
the Sun. The main industries whose operations can be adversely affected by extreme space weather
are the electric power, spacecraft, aircraft radio communications, and (again) GPS-based positioning
industries.
The greatest magnetic storm in historical times was the Carrington event of August-September 1857
when auroral displays were as bright as daylight over an area covering North and South America,
Europe, Asia, and Australia. Magnetic observatories recorded disturbances in Earth’s field so
extreme that magnetometer traces were driven off scale, and telegraph networks around the world the “Victorian Internet” - experienced major disruptions and outages.
Solar activity has been relatively low in the last 50 years but there is some evidence that such activity
is growing. Consequential events are low frequency but, in extreme cases, may have very high
impacts. Needless to say, much work is underway to publicise27 and mitigate the risks.
4.4 The relative dearth of quantitative skills in the population and its impact on evidence
Effective and safe analysis of vast amounts of data requires quantitative training and sometimes
advanced statistical skills. Equally, building new applications to enable such analysis and create new
industries and services requires quantitative skills. In so far as mathematics education is a good
indicator of such abilities, the UK is bottom of a league table of 24 OECD countries in the proportion
of 16 year olds who continue with any maths education28 ; the rapid development of Tech City in
London has been facilitated by many immigrants trained outside of the UK.
It is now widely accepted that improved quantitative training is required in many subject domains, not
just in mathematics education. The government has created a data capability strategy, an important
part of which focuses on human capital development29. Various universities are creating Masters
degree courses for data scientists. The situation is particularly acute at undergraduate level and in
those trained in the social sciences: work by the Economic and Social Research Council (ESRC) –
and endorsed by the British Academy30 - has shown that qualitative UK social science in universities
is of very high quality but that much needs to be done outside Economics in quantitative training. In
response to this, the Nuffield Foundation, ESRC and the Higher Education Funding Council have set
up the £19.5m Q-Step initiative31 which is funding at least 15 new university centres to bring
24
http://www.raeng.org.uk/news/publications/list/reports/RAoE_Global_Navigation_Systems_Report.pdf
http://www.ft.com/cms/s/0/fadf1714-940d-11e3-bf0c-00144feab7de.html?siteedition=uk#axzz2tICPGjzG
26
http://www.foundation.org.uk/Events/AudioPdf.aspx?s=1278&f=20130206_Beddington.pdf
27
National Academies Press (2008) Severe Space Weather Events--Understanding Societal and Economic
Impacts: A Workshop Report. And Lloyds of London (2012) Space weather: its impact on Earth and
implications for business
28
http://www.nuffieldfoundation.org/uk-outlier-upper-secondary-maths-education
29
https://www.gov.uk/government/publications/uk-data-capability-strategy
30
http://www.britac.ac.uk/policy/Society_Counts.cfm
31
http://www.nuffieldfoundation.org/q-step
25
9
enhanced and innovative training into social science undergraduate courses and build links with
schools. Meantime, the Royal Statistical Society’s (RSS) GETSTATS programme has sought to
broaden understanding within parliamentary, school and public audiences of the importance of
acquiring such skills.
The dearth of appropriate skills impacts on our creation of evidence through data analysis and
application of the findings. A common failure is the consistent use of point estimates to justify major
procurement decisions, instead of ranges. One example is claiming that a PFI deal is good value for
money because its NPV is £xm lower, over 30 years, than the NPV of a hypothetical public sector
alternative - rather than explaining that cost projections over this period are uncertain and vulnerable
to shifts in assumptions, and therefore require the presentation of ranges and sensitivities. A prime
demonstration (of many) is the case of the re-franchising of the West Coast Mainline railway, as
dissected by the National Audit Office32. The re-franchising was a major endeavour, with
considerable complexity and uncertainty and a range of overlapping issues. In addition to providing
poor management of the complex programme, the Department of Transport relied heavily on
technical analysis and modelling. Unfortunately there was a significant error in the tool that it used to
calculate the subordinated loan facility. The result of all this was a catastrophe, with the Department
cancelling the provisional (but by then public) letting of the contract to one bidder and a substantial
loss to the taxpayer.
4.5 Dangers of a data-savvy elite combined with gross misrepresentations of scientific findings by the
media or politicians
Implicit in all of the above is a view that running the state or parts of it is best done when underpinned by evidence based on the best available information, quantitative whenever possible - and
where the data and information behind the evidence is publicly accessible. This is the essence of Open
Science as promulgated by the Royal Society33. If both information and the methods by which
analyses of them have been carried out are publicly available then others can attempt to replicate the
conclusions. Evidence-Based Policy founded on such open approaches is sometimes not easy to
operate – external factors (like time pressures) may require decisions made on the best available
previous analyses and less-than-ideal Information. But the principle remains as a gold standard. In the
private sector (e.g. in identifying good sites for supermarkets) commercial confidentiality is often a
serious constraint to open operations.
Perhaps surprisingly, there may be a downside to the case made earlier for multiplying the numbers of
data scientists. Whilst public trust in scientists has held up well34 (except perhaps for those involved
in the nuclear industry), the advent of a small group of experts setting the agenda based on analyses
that only they understand is highly dangerous (as in the case of banks before the financial crisis in
2008). Put simply, this argues for still greater efforts to exchange information, views and concerns by
public debate between the lay public and specialists – rather than simply trying to gain public
acceptance by circulated documentation written by experts.
Nowhere is the issue of small numbers of experts advising decision-makers or key influencers more
important than in the House of Commons (c.f. House of Lords) and the media. The House of
Commons (and, by extension, Ministers) contains very few trained scientists and quantitatively
trained individuals. The same dearth exists amongst journalists, who are drawn mostly from arts
backgrounds yet play a crucial role in informing the public. It is not surprising therefore that
misinterpretations are made and sometimes scathing comments are made about those data analysts
bearing bad news. One common manifestation of this is the focus on the most recent figures, such as
measures of economic growth (e.g. as for GDP) or inflation (e.g. CPI). Thus we see headlines about a
short term change of 0.1% in GDP (within the margin of measurement error) whilst the longer term
trends are ignored. That said, some of the blame must lie with data producers and publishers; many
cases occur of publications being written for technical elites and little interpretation of the presented
32
http://www.nao.org.uk/wp-content/uploads/2012/12/1213796es.pdf
https://royalsociety.org/uploadedFiles/Royal_Society_Content/policy/projects/sape/2012-06-20-SAOE.pdf
34
http://www.ipsos-mori.com/Assets/Docs/Polls/pas-2014-main-report.pdf
33
10
data being made. The UK Statistics Authority has been running a campaign to enhance explanation in
statistical publications; an ONS Good Practice Team has been supporting government departments to
this end. And the RSS has run many courses for journalists on the safe use of statistics.
Put very simply, sound policy-making needs as good an information base as we can assemble. But it
also requires openness, clarity about the limitations of the evidence, reproducibility and even some
humility on the part of the evidence providers; it equally requires preparedness to listen and critique
plus receptiveness on the part of the policy makers or other decision-makers. People matter
immensely even in a world of dazzling technology and vast data.
5.
Information governance, regulation and the contribution of civil society
A necessity in addressing many of the public concerns about use and misuse of data or information is
the need to have sound and transparent governance arrangements in place and for them to be widely
known and tested. Good information governance applies to all handling of information but
particularly to personal information. An important early milestone was the 1997 Caldicot Report on
information governance in the NHS. This established six principles which are also relevant to many
situations outside of the health service:

Justify the purpose(s). Every single proposed use or transfer of patient-identifiable
information within or from an organisation should be clearly defined and scrutinised, with
continuing uses regularly reviewed, by an appropriate guardian.

Don't use patient-identifiable information unless it is necessary.

Use the minimum necessary patient-identifiable information.

Access to patient-identifiable information should be on a strict need-to-know basis.

Everyone with access to patient-identifiable information should be aware of their
responsibilities.

Understand and comply with the law.
Subsequently Dame Fiona Caldicot led a second review35 which reported in 2013. This added a
seventh principle:

The duty to share information can be as important as the duty to protect patient
confidentiality.
Beyond all that, good governance must include:

Sound processes for vetting requests for access to personal data and for assessing the likely
public benefits and the risks to privacy.

Access to the data is provided through ‘safe havens’ which have strong security controls.

Safeguards on physical data storage based on international safety standards.

Undertakings from the users about ‘identifiability’, with effective legal sanctions should these
be broken (e.g. gaol sentences).
Governance standards must be published and readily accessible. Those responsible for their oversight
and implementation must be specified and complaint and appeal mechanisms published.
35
https://www.gov.uk/government/uploads/system/uploads/attachment_data/file/192572/2900774_InfoGovernanc
e_accv2.pdf
11
Beyond these hugely important governance arrangements within organisations handling information,
appropriate independent regulation arrangements need to be in place. For all official statistics this is
provided under statute by the UK Statistics Authority. The Office of Public Sector Information (PSI)
plays an important role in regard to the use of PSI, not least in supervising adherence to UK
legislation arising from relevant EU Directives. An array of other regulators (e.g. within the NHS, the
Office the Information Commissioner and the Office of Fair Trading) play important roles though the
complexity of the information-related regulation landscape makes for many difficulties.
In addition to these parts of the ‘official infrastructure’ of regulation there are some civil society
organisations which analyse and comment on public statements in the media and by politicians. A
good example is Full Fact36. A plethora of constant comment is also produced daily via blogs. Some
of this – such as by Ben Goldacre on health matters – is invaluable. In totality they provide a valuable
correction mechanism to erroneous or misleading information.
6.
Conclusions
The most fundamental question I set out to address is how does the lay public know what information
to trust and who to trust? All of the above is designed to underpin the answer to this question though
my answer will not guarantee success!
In an ideal world, the lay public would have much greater understanding of quantitative approaches
including uncertainty and risk. Uncertainty exists in various parts of any analysis or study, whether it
is in the measurement process, the consequences of inbuilt assumptions and of selecting particular
models. Achieving this is a long term aim. What follows deals more with the short term.
For the data scientist, sensitivity testing of all assumptions and models built on them is essential and
must be considered in assessing and reporting results. Identification of hazards and risks, plus
mitigation of the latter is central to any application of evidence. Confirmatory bias – more readily
accepting results which support prior intuition – must be guarded against.
The non-scientist manager will wish to ensure that all these steps have been taken. But other ‘sense
tests’ may help to guide the non-expert. These include:
36

Be sceptical but be willing to be convinced.

Triangulating information from multiple sources to assess consistency and understand reasons
for any differences. Thus the varied but broadly consistent responses from multiple central
government departments to the January 2014 public consultation on the importance and uses
of Census information, alongside that from the Local Government Association, were very
convincing.

Relying on organisations that others trust and whose public statements are subject to expert
scrutiny by other experts (such as by the Institute for Fiscal Studies and Full Fact) – and
preferably those that use Plain English and have a long and good track record.

Placing greater reliance on data and information from long-established, successful
organisations which have much to lose if their reputation is sullied (but beware of monpolies).

Other things being equal, placing reliance upon bodies which are relatively independent,
having their own endowments and which operate at arm’s length from governments or other
funders is wise.

Requiring that any individual or organisation which produces evidence can explain the
sources and characteristics of data or information they used, how it was derived, the quality
assurance processes used and the main sources of uncertainty (i.e. the metadata).
https://fullfact.org/
12

Having appropriate information governance arrangements, suitably comprehensive and
publicly accessible. Is the organisation making questionable statements regulated and is there
redress for any failures available?
As indicated in Section 5, anyone holding management responsibility within an informationproducing and –consuming organisation (i.e. all of them) must ensure not only that good governance
processes operate. They must also ensure that horizon scanning for new developments and risks
relevant to their organisation is carried out by suitably tasked staff.
In a world of uncertainty and ‘hype’ arising from revolutionary technological developments,
scepticism is a sound approach. But these developments affect all of us whether we wish it or not. It
would be singularly unfortunate to play so safe that we lost opportunities to explore and better
understand the natural environment, human societies, health and well-being. All of us must make our
own judgements about the balance of risk and reward. But for me, the use of the great bulk of nonpersonal information is not contentious and the benefits are often likely to be significant. Even in
regard to personal information, I believe that we should exploit the benefits of its use for the public
benefit, using guidance, control mechanisms and safeguards such as those extolled above. I hope you
agree.
Acknowledgements
Thanks are due to those who made suggestions on the contents of a draft of this paper - Ed
Humpherson, Will Moy, Phil Sooben, Carol Tullo and Sharon Witherspoon. Any errors are the
author’s responsibility.
David Rhind - dwrhind@gmail.com
Chair, APPSI
April 2014
13
Download