Drowning in data: who and what can we trust1? David Rhind, Chair of APPSI Summary This paper takes a wide view of the issues arising from the advent of ‘Big Data’. It distinguishes information and data and separates data into three groups based on custodianship because these have different implications for data exploitation. The three categories are Open Data (and the National Information Infrastructure), personal data held by government and data held as a commercial asset. The paper then addresses five issues: threats to privacy and the trade-offs involved with public benefits, the appropriate role of the state, the consequences of technology failure, the dearth of UK quantitative skills, and misunderstandings or misrepresentations of data. Central to minimising threats to privacy and maintaining public trust are good governance, regulation and independent ‘fact checking’. Finally, guidance on how to assess evidence or conclusions derived from data and information is given. 1. From data famine to information feast? In 1798 Thomas Malthus predicted population growth would increase faster than the availability of food, with catastrophic consequences. But, 216 years later, Malthus’ apocalyptic prediction has not yet been realized (though nearly one in eight of the world’s population are suffering from chronic undernourishment2). The science of crop breeding, new technologies and improved management practices have been the main contributors to averting the cataclysm. In a later parallel, librarians argued in 1944 that the growing volume of publications would outstrip their capacity to store it. This was based on the observation that US university libraries were then doubling in size every 16 years. What has happened is that the data and information collected by governments, private sector, the voluntary sector and individuals have shifted dramatically from being collected and held in analogue (e.g. paper) form to being stored digitally. In 1986 around 99% of all information was stored in analogue form; by 2007 some 94% was being stored in digital form3. Moreover the total amount of data has grown exponentially: it has been estimated that more data was harvested between 2010 and 2012 than in all of preceding human history4. This increase owes much to new technologies of data collection, notably video recordings, plus sensors built into satellites, retail outlets, refrigerators, etc and of processing and dissemination. Included are the results of many surveys reported daily in the media, some of which give minimal details of how the (often questionable) results were obtained. Much of this informs decisions and shapes attitudes and opinions. What of it can we trust and how do we know? 1 This paper formed the basis of the University of Bath Founder’s Day Lecture, given by the author on 2 April 2014. Neither the University of Bath or the Government’s Advisory Panel for Public Sector Information (APPSI) bear any responsibility for the views expressed. 2 UN Food and Agriculture Organisation estimate for 2010-2012. 3 Hilbert M and Lopez P (2011) The world’s technological capacity to store, communicate, and compute information Science 332 no. 6025 , 60-65 4 This claim is normally attributed to IBM but see http://www.bbc.co.uk/news/business-17682304. Much of the data is user-collected (e.g. videos and photographs, emails, and social media traffic including tweets). 2. The nature of data and information In what follows it is helpful to know that data and information have particular characteristics. These include: 3. Conventionally we use a terminology based on a hierarchy of value rising from data to information to evidence to knowledge to wisdom. Data are numbers, text, symbols, etc that are in some sense value-free e.g. a temperature measurement. Information is formally differentiated from data by implying some degree of selection, organisation and relevance to a particular use. i.e. multiple sets of information may be derived by selections from and analysis of a single data set. In practice, however, the distinction between data and information is often blurred in common usage e.g. ‘Open Data’ and ‘Public Sector Information’ as used in government publications frequently overlap or are used synonymously. Even though the initial cost of collecting, quality assuring and documenting data may be very high5, the marginal cost of providing an additional digital unit is close to zero. Use by one individual does not reduce availability to others (termed non-rivalry). In some cases individuals can not be excluded from using the good or service (termed nonexcludability) Linkage of data can generate many more possible uses. Thus if 20 different sets of information are collected for all of the Local Authorities (LAs) in Britain, there will be 190 pairs of information and over 1 million combinations in total for each LA, some of which may reveal new knowledge. Unlike fuel, data and information are not consumed by use. They may get out of date (like census information) but rise again in value in comparisons with new information. Exploiting the data mountain Mountains of data are of little value unless they can be transformed into information and used, often for a myriad of initially unforeseen purposes. The advent of the internet and then the world wide web in the 1980s and ‘90s has transformed human experience and business practice: it is now routine to find information lodged in distant lands, often in a few seconds, and share it with others or analyse it, frequently in conjunction with other information. Moreover it is now also routine to go beyond analysis of the ‘what is’ situation. ‘What if’ analyses and modelled projections and forecasts are becoming commonplace – such as the prediction by retailers like Amazon of what you are likely to buy - and even the locations of future crimes. It is convenient to view information for present purposes as sub-divided into three parts based on their custodianship: data created as part of the National Information Infrastructure (NII) and often made available as Open Data; personal data held by the state; and personal and other data held by commercial bodies. In practice, the categories are not distinct but lie on a spectrum. 3.1 Non-personal data as national infrastructure and Open Data The importance of maintaining a sound national physical infrastructure - roads, railways, utilities and so on - is well-recognized. Thus the interstate highways of the USA transformed commerce, employment and recreational travel from the 1950s on. In times of recession, many governments across the world see suitable 'shovel ready' infrastructure projects as a means of stimulating the economy and generating long term efficiencies. In essence, this amounts to building and exploiting a 5 The 2011 UK Census of Population data cost approximately £480m to plan, collect and process (one of the lowest per capita costs in the world for such surveys) but data dissemination costs were minimal. 2 National Infrastructure6. What we are seeing now is the emergence of a new National Information Infrastructure7. This is not coherent – like Topsy, it has simply ‘growed’ – in this case from legacy information needed by the state to carry out its roles or to generate revenue. There is a good case – which UK government seems to have accepted – for now regarding this more strategically by thinking about what information is really needed for the future and how technology could help provide it. Until recently, governments have been the largest ‘owners’ of data about their people, their economies, and physical phenomena like the shape of the land surface, the weather and climate and the geography of transport networks. Some government actions in the information domain have been of immense public benefit far outside their national boundaries. The obvious example is the creation of the Global Positioning System (GPS) by the US government It has under-pinned a whole new global location services industry valued in 2013 by the economic consultancy Oxera at between $150 and 270 billion per annum, as well as European and Russian competitors. For governments, making the non-personal information they hold easily and publicly accessible has the potential to provide four positive outcomes. These are enhanced transparency and government accountability to the electorate, improved public services, better decision-making based on sound evidence, and enhancing the country’s competitiveness in the global information business. All this is part of the Open Data agenda embraced enthusiastically by the current government and its predecessor. Some 13,000 government data sets are now accessible via data.gov.uk8 and can be used for any purpose – personal, communal or commercial - under the Open Government Licence9. The same agenda has been picked up by some local authorities and many other public bodies. It has also become an international growth industry, with the the Open Data charter10 signed by the leaders of Canada, France, Germany, Italy, Japan, Russia, the UK and the USA at the G8 meeting in June 2013. This commits those nations to a set of principles (‘Open by default’, ‘Usable by all’, etc), to a set of best practices (including metadata provision i.e. data describing the main data set classifications, etc) and national action plans with progress to be reported publicly and annually. Some 60 countries have also come together in the Open Government Partnership - under-pinned by Open Data. 3.2 Personal data and public benefit Much government data is collected about individuals and used for administrative purposes or for research which helps shape policy. Subsequently such data may be aggregated to produce information about geographical areas or sub-groups of the population – and increasingly this is then made available as Open Data. Following a report by Thomas and Walport11, the UK government has recognised the utility of using such personal data12 for purposes other than those for which they were originally intended, notably to reduce costs and provide more up to date statistical and other results (Section 4.1). As a result, they have funded through the Economic and Social Research Council (ESRC) a set of Administrative Data Research Centres (ADRCs) in each country of the UK and an ADR Network (ADRN). These are designed to enable carefully controlled access to a variety of 6 https://www.gov.uk/government/collections/national-infrastructure-plan Promulgated by the Advisory Panel on Public Sector Information (APPSI) since 2010, this has now been widely adopted as a concept e.g. https://www.gov.uk/government/publications/national-informationinfrastructure 8 Though a substantial fraction were previously available from a multiplicity of government web sites. 9 Measuring the real success of the initiative is difficult, as the Public Administration Select Committee has pointed out http://www.publications.parliament.uk/pa/cm201314/cmselect/cmpubadm/564/564.pdf 10 https://www.gov.uk/government/publications/open-data-charter/g8-open-data-charter-and-technical-annex 11 http://webarchive.nationalarchives.gov.uk/+/http:/www.justice.gov.uk/docs/data-sharing-review.pdf 12 Personal data as described by the Data Protection Act of 1998 is data that relates to living individuals who are or can be identified from the data. Organisations that want or need to share, disseminate or publish their data for secondary use are obliged under the DPA (1998), unless exempt, to transform the data in such a way as to render it anonymous and therefore no longer personal. (see http://ukanon.net/key-information/) 7 3 detailed administrative data for academic research or government policy purposes. The ESRC – funded bodies are accountable to Parliament through the UK Statistics Authority. Carefully controlled access within ‘safe havens’ to linked personal data can produce major public benefits. Many of the benefits are not available through analysis of aggregate data, such as those for large population (e.g. ethnic) groups or for populations in geographical areas. Potentially beneficial uses of such individualized data include: Reducing the burden on people or businesses to fill in multiple government questionnaires, with data collected for one purpose being spun off from databases originally collected for another. Ensuring that wherever an individual travels, his or her detailed health records could be available to any doctor in the event of a sudden illness or accident. Tracking the life histories and geographies of individuals to study correlations between, for example, exposure to environmental hazards and later illnesses Knowing where people are in real time and hence the location of potential victims in the event of natural disasters. Knowing the movements of crowds in real time from mobile phones can greatly help police to manage crowds safely. Studying inequality in society through analysis of life experiences by people in small, specifically defined groups (e.g. population cohorts, local communities or ethnic groups), from which new policies and actions might flow. Reducing the incidence of fraud by merging multiple administrative datasets together (e.g. tax records and social security benefits) and seeking anomalies. Profiling people on the basis of their background, personal characteristics, DNA or contacts for setting insurance premiums, marketing or credit-rating purposes – or, very controversially, to detect those predisposed toward acts of crime or terrorism. The potential benefits of linkage and sharing of personal information are most obvious in health but exist in many other domains. As indicated above, joining together information held in hospitals and that by GPs about the same person, as proposed under the NHS Care.data plans, can enable more ‘joined up’ care and also facilitate research very likely to lead to long term public benefits. In part this is because the UK has a strong competitive advantage through the national coverage of the NHS and resulting large volumes of health data being more comparable than elsewhere. Yet the six month delay in implementation of Care.data triggered by antipathy from GPs and other parties13 - despite such data linkage operating in Scotland - demonstrates the need for extraordinary efforts to secure public trust before launching schemes based on use of linked data about each and every person14. 3.3 Data as a commercial asset In some respects the government primacy in data holdings has changed dramatically. Commercial enterprises now routinely collect and exploit vast amounts of information about their individual customers, notably through loyalty cards or through customer agreements (such as for loans). Profiling of their individual customers helps target sales offers and increase favourable responses by a factor of ten in many cases. Google maps are now embedded in around 800,000 corporate web sites in Britain and have much private and public sector data linked to its geography. Substantial amounts of data are also collected through crowd-sourcing – such as the world-wide Open Street Map which has 13 See. for example, http://www.telegraph.co.uk/health/healthnews/10656893/Hospital-records-of-all-NHS-patients-sold-toinsurers.html, http://www.theguardian.com/society/2014/feb/21/nhs-plan-share-medical-data-save-lives and http://www.theguardian.com/commentisfree/2014/feb/28/care-data-is-in-chaos 14 http://www.hscic.gov.uk/media/13726/Kingsley-Manning-speechHC2014/pdf/201314_HSC2014_FINAL.pdf 4 over a million users and tens of thousands of volunteer contributors. One consequence of all this is the commoditization of some data. What was once a resource highly valuable because of limited access or restrictive practices followed by governments is now increasingly a low cost commodity. This – and increasing competition in data supply, some global has big implications for organizations trading in data. To survive they must add value, often by providing expert services using the data and ‘co-mingling’ it to provide a unique offering. A good example of this is the Climate Corporation15. This is a US start-up founded in 2006 by two former Google data scientists. It has combined 30 years of weather data, 60 years of crop yield data, and 14 terabytes of soil data—all free from US government agencies. The services offered include yieldforecasting to help farmers make decisions such as where and when to plant crops in order to boost productivity, plus weather and crop insurance to help manage risk. The 'precision agriculture' firm was acquired by Monsanto, the world's largest seed company, for $930m last October. The success of Climate Corporation seems to owe much to ‘first mover’ advantage rather than guarding intellectual property rights – a contrast to many other information traders (including some government bodies). 3.4 The new information ecosystem In summary, there has been the collection of unprecedented volumes of accessible data, the emergence of massively more powerful analytical tools to convert data into information and the means to transfer the latter near-instantaneously. This is often encapsulated in the term ‘Big Data’. Technologically-driven enthusiasm has been misplaced before: In 1858 a Victorian said: “It is impossible that old prejudices and hostilities should longer exist, while such an instrument has been created for the exchange of thought between all the nations of the Earth16”. He referred to the laying of the first submarine trans-Atlantic telegraph cable. Yet if we ignore the ‘hype’ frequently associated with the Big Data term, these developments taken together constitute a new information ecosystem. This provides an opportunity to enhance how we operate for the public good and also to build commercial benefit. There is increasing acceptance in the UK and beyond that Open Data – comprising non-personal data - is ‘a good thing’17 and hence to be welcomed. But the attitude to personal data about individuals is another matter, especially that held by government. The purposes for which such personal data are proposed to be used, the nature of the engagement with the affected individuals and the consent mechanisms in place can materially affect public trust in the data custodians and the whole of government. I now address the main problems in securing the benefits of the new technologies and data. 4. But what about the problems? The developments described above have already raised the following serious issues: Real or perceived threats to individual privacy, sometimes through pursuit of a wider public good. The appropriate role of the state. Consequences of technology failure. 15 David Kesmodal, “Monsanto to buy Climate Corp. for $930 million, Wall Street Journal, October 2, 2013 Cited in The Economist 17 August 2000 17 But see the PASC report http://www.publications.parliament.uk/pa/cm201314/cmselect/cmpubadm/564/564.pdf 16 5 The relative dearth of quantitative skills in the population and its impact on the creation and use of evidence. The dangers posed by a combination of a data-savvy elite and gross inadvertent or other misrepresentations of scientific findings in the media or by politicians. All this leads inexorably to the fundamental question of how does the lay public know who and what information to trust? I now consider each of the issues in turn before returning to this question. 4.1 Threats to individual privacy, sometimes through pursuit of a wider public good Our personal right to privacy has in practice become sharply undermined since the attack on the Twin Towers in New York in 2001. This event, together with the new technologies described above, have led to a global increase in surveillance, capturing the movements, characteristics and conversations of people on a massive scale (as Snowden revealed). Understandably this has given rise to serious concerns about privacy infringements, the role of the state and the trade-off between privacy and public benefit. We should distinguish between the different bases on which public and private sector bodies collect and hold personal information. That collected by government bodies is very often mandated by law: citizens have no right to refuse but benefit from government using the data in planning facilities or supplying social services (as well as collecting income-based taxes). In contrast, individuals voluntarily - and often happily - provide personal information to many private sector bodies (e.g. retailers) for gain such as discounts on purchases. There is some opposition to the consent mechanism which normally consists of ticking a single box to sign up to lengthy contracts containing much small print. Yet this is entirely legal. Attitudes to privacy may be context-dependent. For instance, parents concerned about the safety of their children might well contemplate giving them a smartphone with a location-reporting app. An injured hill walker or a wounded soldier would be happy for the rescue services to know their location. In these latter examples, an element of privacy is being traded off to ensure enhanced safety at some moment in time. We should also note that what constitutes privacy and what information is considered appropriate for the state to hold and use in combination varies greatly in different cultures. Nordic countries operate a series of linked registers to provide large amounts of administrative data about individuals for multiple government purposes. In general, three base registers exist containing details of each ‘data unit’. These are a population register, a business register (enterprises and establishments) and a register of addresses, buildings and dwellings. A unique identifier for each ‘data unit’ in the base registers, and links between them, is used. The unique identifier is also used in many other administrative registers for these ‘data units’ such as educational registers and taxation registers. An extract from the population register serves as a basic document that is needed when applying for a passport, getting married or divorced, or when a funeral is held or an estate is distributed. The privacy issue is essentially about whether data concerning an identifiable individual - held by governments or private sector bodies - are improperly disclosed. The extreme approach to safeguarding such personal data is that proposed in 2013 by the European Parliament for the basis of a revised EU Data Protection Directive: that on every occasion where such a data set is to be used for a new purpose, permission for each record to be re-used should be obtained from the individual concerned. This would devastate research and societal benefit in the health domain and in many others. Central to any proper use of personal data by third parties is adherence to relevant laws or protocols plus application of sound methods which anonymize the records. The UK Anonymisation Network publishes best practice in anonymisation and offers practical advice and information to anyone who 6 handles personal data and needs to share it18. It is widely recognised that the creation of a single unique identifier for each citizen, applied consistently across all datasets from the outset, has the potential to allow unauthorised re-identification and could compromise individuals' privacy. An approach has therefore been developed based on attaching unique identifiers to data shares on a case by case (or project by project) basis, where the identification algorithms and identifiers are destroyed once matching and linking have taken place. This gives assurance within each matched dataset that the data relates to the same data subject, but significantly reduces the risk of subsequently matching the data to other sources without the authority of the data controller. The approach provides strong controls on how data are used, and provides transparency for the citizen so that they can hold data controllers and statisticians to account. There are well-proven examples where successful and safe use has been made of data relating to individuals using such safeguards. For instance the Scottish Longitudinal Study19 (SLS) is a largescale linkage study created using data from multiple administrative and statistical sources and benefiting from the integrated Scottish governmental structure. The data include detailed population census data from 1991 onwards; vital events data (births, deaths, marriages); the NHS Central Register data (which gives information on migration into or out of Scotland); and education data (including Schools Census and other data). The SLS is a sample of 274,000 individuals from the Scottish population using 20 random birthdates. All data are anonymized and strict procedures for use of the data are in place, with each proposed analysis requiring approval by ethics committees and oversight by senior individuals i.e. there is a robust form of governance in place. The result is a series of valuable studies which could not otherwise have been carried out and which have led to important changes in policy and practice: the rates of blindness and amputation amongst diabetes sufferers in Scotland have, for instance, been dramatically reduced in part by early identification of those at risk through data linkage. It is also relevant that no cases are known of any leakage of personal data from censuses in the last century (the entirety of the personal data for every individual is released under statute 100 years after it is collected). The fundamental question is whether, despite all the techniques cited above, any anonymized personal data can be de-anonymized? As always, there are some trade-offs: accepting some risk can improve the benefits. A major contribution to anonymization is to remove all indicators of geography yet this usually severely limits the value of the data. In addition, and by using fuzzy logic techniques, it may in some cases be possible to identify some individuals by bringing together two previously anonymized data sets. Experts have concluded that it is probably impossible to guarantee that anonymization can always work for all time by technical means alone. However in well designed and operated systems with good governance structures, access through ‘safe havens’, and criminal sanctions against mis-use (see section 5) such failures should be very rare indeed whilst the potential social benefits could be very large. There is some evidence that members of the public understand the trade-offs and support data linkage20. It is worth noting that, once ‘captured’ in any governmental and some commercial systems it is difficult to escape. Moreover in many cases there will always be a record that you once existed on the system. There is no universal right to choose to be forgotten! 4.2 The appropriate role of the state As indicated earlier, the nation state (and local governments) remain central to creating data and extracting much utility from derived information. But the role of the state matters beyond the handling 18 http://ukanon.net/ http://sls.lscs.ac.uk/about/ 20 http://www.scotland.gov.uk/Publications/2012/08/9455/0, http://www.ipsosmori.com/DownloadPublication/1652_sri-dialogue-on-data-2014.pdf 19 7 of personal data. Here my focus is on commercial exploitation of information assets held by government. Most government bodies collect data only to meet their ‘public task’. What they make available thereafter can be considered ‘exhaust data’, the great bulk of which is made available free or at marginal cost (usually near zero). However the UK operates a ‘mixed data economy’: there is a group of public bodies which are charged with covering their costs through commercialisation of data or services based upon the data they collect and hold. Of the 20 or so Trading Funds, Companies House, the Met Office, Land Registry and Ordnance Survey all generate very substantial proportions of their costs by such means (though all have been persuaded to make some of their information available as Open Data). Other bodies, such as the Environment Agency, make charges for commercial use of their (e.g. flood) data. The most controversial example of the commercial/Open Data tension is Royal Mail which created detailed data about all addresses across the UK whilst a publicly owned corporation. Royal Mail claimed that its privatisation was at risk if it did not retain ownership of and exploitation rights to this data and politicians acquiesced - despite many arguments to the effect that this was a critical core data set, part of the National Information Infrastructure and should be Open Data. Parallel considerations are (at the time of writing) under consultation in regard to the possible privatisation of Land Registry21. The Public Administration Select Committee has strongly criticised such sales of information assets22, arguing key information assets should be retained in public ownership. Leasing them out may be a much more appropriate approach. But public ownership alone does not obviate other problems. For example there is a real danger of quasi-monopolies enjoyed by government bodies exerting their market position to squeeze out competition from the private sector through an array of techniques. This might very well lead – at the very least - to a reduction in public trust in such bodies but would also potentially damage the growth of competitors and thereby reduce benefits to the public. Robust regulation of Trading Fund activities, mainly by the Office of Public Sector Information and the Office of Fair Trading, is therefore of critical importance. A further source of potential problems is the creation of commercial monopolies through outsourcing of tasks carried out previously within government. A number of cases have occurred where data collected to carry out the task was claimed to be owned by the contractor and hence not capable of being made available as Open Data. This has been recognised and government is increasingly employing procurement contracts which obviate the issue. Indeed there are some signs of very positive approaches by a few commercial bodies in working with public bodies, notably by Elgin23 which provides a national service about the location and other details of roadworks assembled from disparate local data sets. 4.3 Causes and consequences of technology failure We take the reliability of electronic devices, especially consumer items like smartphones and embedded Global Positioning Systems (GPS), to be a given. Yet that is not necessarily valid. Modern technological societies are characterized by a complex interweave of dependencies and interdependencies among its critical infrastructures. Any significant failure, especially to electricity supplies, can have cascading consequences. Perhaps the biggest risks are in regard to civilian use of GPS. These arise from both human and natural hazards - and it matters. The European Commission estimated in 2011 that 6-7% of GDP in Western countries, i.e. € 800 billion in the European Union, is dependent on satellite radio navigation. 21 https://www.gov.uk/government/uploads/system/uploads/attachment_data/file/274493/bis-14-510introduction-of-a-land-registry-service-delivery-company-consultation.pdf 22 http://www.parliament.uk/business/committees/committees-a-z/commons-select/public-administration-selectcommittee/news/open-data-substantive/ 23 http://www.elgin.org.uk/ and http://roadworks.org/ 8 Use of GPS for car guidance and maritime navigation (especially in busy areas like the English Channel) is ubiquitous; sudden loss or modification of the signals (e.g. by criminals) could be catastrophic. Moreover a Royal Academy of Engineering report24 pointed out that the time signals from the GPS system in particular have found application in managing data networks and mobile telephony. Civilian GPS signals may be readily blocked over small areas by devices readily purchased on the internet. North Korea is believed to use Russian devices to jam GPS signals over much wider areas of South Korea. The UK government and others are seeking to build greater resilience into these systems: that should be aided by the European Galileo navigation system but the threat remains very serious25. In addition, the UK national civilian risk register26 identifies space weather as a major threat. A variety of technologies are susceptible to the extremes of space weather - severe disturbances of the upper atmosphere and of the near-Earth space environment that are driven by the magnetic activity of the Sun. The main industries whose operations can be adversely affected by extreme space weather are the electric power, spacecraft, aircraft radio communications, and (again) GPS-based positioning industries. The greatest magnetic storm in historical times was the Carrington event of August-September 1857 when auroral displays were as bright as daylight over an area covering North and South America, Europe, Asia, and Australia. Magnetic observatories recorded disturbances in Earth’s field so extreme that magnetometer traces were driven off scale, and telegraph networks around the world the “Victorian Internet” - experienced major disruptions and outages. Solar activity has been relatively low in the last 50 years but there is some evidence that such activity is growing. Consequential events are low frequency but, in extreme cases, may have very high impacts. Needless to say, much work is underway to publicise27 and mitigate the risks. 4.4 The relative dearth of quantitative skills in the population and its impact on evidence Effective and safe analysis of vast amounts of data requires quantitative training and sometimes advanced statistical skills. Equally, building new applications to enable such analysis and create new industries and services requires quantitative skills. In so far as mathematics education is a good indicator of such abilities, the UK is bottom of a league table of 24 OECD countries in the proportion of 16 year olds who continue with any maths education28 ; the rapid development of Tech City in London has been facilitated by many immigrants trained outside of the UK. It is now widely accepted that improved quantitative training is required in many subject domains, not just in mathematics education. The government has created a data capability strategy, an important part of which focuses on human capital development29. Various universities are creating Masters degree courses for data scientists. The situation is particularly acute at undergraduate level and in those trained in the social sciences: work by the Economic and Social Research Council (ESRC) – and endorsed by the British Academy30 - has shown that qualitative UK social science in universities is of very high quality but that much needs to be done outside Economics in quantitative training. In response to this, the Nuffield Foundation, ESRC and the Higher Education Funding Council have set up the £19.5m Q-Step initiative31 which is funding at least 15 new university centres to bring 24 http://www.raeng.org.uk/news/publications/list/reports/RAoE_Global_Navigation_Systems_Report.pdf http://www.ft.com/cms/s/0/fadf1714-940d-11e3-bf0c-00144feab7de.html?siteedition=uk#axzz2tICPGjzG 26 http://www.foundation.org.uk/Events/AudioPdf.aspx?s=1278&f=20130206_Beddington.pdf 27 National Academies Press (2008) Severe Space Weather Events--Understanding Societal and Economic Impacts: A Workshop Report. And Lloyds of London (2012) Space weather: its impact on Earth and implications for business 28 http://www.nuffieldfoundation.org/uk-outlier-upper-secondary-maths-education 29 https://www.gov.uk/government/publications/uk-data-capability-strategy 30 http://www.britac.ac.uk/policy/Society_Counts.cfm 31 http://www.nuffieldfoundation.org/q-step 25 9 enhanced and innovative training into social science undergraduate courses and build links with schools. Meantime, the Royal Statistical Society’s (RSS) GETSTATS programme has sought to broaden understanding within parliamentary, school and public audiences of the importance of acquiring such skills. The dearth of appropriate skills impacts on our creation of evidence through data analysis and application of the findings. A common failure is the consistent use of point estimates to justify major procurement decisions, instead of ranges. One example is claiming that a PFI deal is good value for money because its NPV is £xm lower, over 30 years, than the NPV of a hypothetical public sector alternative - rather than explaining that cost projections over this period are uncertain and vulnerable to shifts in assumptions, and therefore require the presentation of ranges and sensitivities. A prime demonstration (of many) is the case of the re-franchising of the West Coast Mainline railway, as dissected by the National Audit Office32. The re-franchising was a major endeavour, with considerable complexity and uncertainty and a range of overlapping issues. In addition to providing poor management of the complex programme, the Department of Transport relied heavily on technical analysis and modelling. Unfortunately there was a significant error in the tool that it used to calculate the subordinated loan facility. The result of all this was a catastrophe, with the Department cancelling the provisional (but by then public) letting of the contract to one bidder and a substantial loss to the taxpayer. 4.5 Dangers of a data-savvy elite combined with gross misrepresentations of scientific findings by the media or politicians Implicit in all of the above is a view that running the state or parts of it is best done when underpinned by evidence based on the best available information, quantitative whenever possible - and where the data and information behind the evidence is publicly accessible. This is the essence of Open Science as promulgated by the Royal Society33. If both information and the methods by which analyses of them have been carried out are publicly available then others can attempt to replicate the conclusions. Evidence-Based Policy founded on such open approaches is sometimes not easy to operate – external factors (like time pressures) may require decisions made on the best available previous analyses and less-than-ideal Information. But the principle remains as a gold standard. In the private sector (e.g. in identifying good sites for supermarkets) commercial confidentiality is often a serious constraint to open operations. Perhaps surprisingly, there may be a downside to the case made earlier for multiplying the numbers of data scientists. Whilst public trust in scientists has held up well34 (except perhaps for those involved in the nuclear industry), the advent of a small group of experts setting the agenda based on analyses that only they understand is highly dangerous (as in the case of banks before the financial crisis in 2008). Put simply, this argues for still greater efforts to exchange information, views and concerns by public debate between the lay public and specialists – rather than simply trying to gain public acceptance by circulated documentation written by experts. Nowhere is the issue of small numbers of experts advising decision-makers or key influencers more important than in the House of Commons (c.f. House of Lords) and the media. The House of Commons (and, by extension, Ministers) contains very few trained scientists and quantitatively trained individuals. The same dearth exists amongst journalists, who are drawn mostly from arts backgrounds yet play a crucial role in informing the public. It is not surprising therefore that misinterpretations are made and sometimes scathing comments are made about those data analysts bearing bad news. One common manifestation of this is the focus on the most recent figures, such as measures of economic growth (e.g. as for GDP) or inflation (e.g. CPI). Thus we see headlines about a short term change of 0.1% in GDP (within the margin of measurement error) whilst the longer term trends are ignored. That said, some of the blame must lie with data producers and publishers; many cases occur of publications being written for technical elites and little interpretation of the presented 32 http://www.nao.org.uk/wp-content/uploads/2012/12/1213796es.pdf https://royalsociety.org/uploadedFiles/Royal_Society_Content/policy/projects/sape/2012-06-20-SAOE.pdf 34 http://www.ipsos-mori.com/Assets/Docs/Polls/pas-2014-main-report.pdf 33 10 data being made. The UK Statistics Authority has been running a campaign to enhance explanation in statistical publications; an ONS Good Practice Team has been supporting government departments to this end. And the RSS has run many courses for journalists on the safe use of statistics. Put very simply, sound policy-making needs as good an information base as we can assemble. But it also requires openness, clarity about the limitations of the evidence, reproducibility and even some humility on the part of the evidence providers; it equally requires preparedness to listen and critique plus receptiveness on the part of the policy makers or other decision-makers. People matter immensely even in a world of dazzling technology and vast data. 5. Information governance, regulation and the contribution of civil society A necessity in addressing many of the public concerns about use and misuse of data or information is the need to have sound and transparent governance arrangements in place and for them to be widely known and tested. Good information governance applies to all handling of information but particularly to personal information. An important early milestone was the 1997 Caldicot Report on information governance in the NHS. This established six principles which are also relevant to many situations outside of the health service: Justify the purpose(s). Every single proposed use or transfer of patient-identifiable information within or from an organisation should be clearly defined and scrutinised, with continuing uses regularly reviewed, by an appropriate guardian. Don't use patient-identifiable information unless it is necessary. Use the minimum necessary patient-identifiable information. Access to patient-identifiable information should be on a strict need-to-know basis. Everyone with access to patient-identifiable information should be aware of their responsibilities. Understand and comply with the law. Subsequently Dame Fiona Caldicot led a second review35 which reported in 2013. This added a seventh principle: The duty to share information can be as important as the duty to protect patient confidentiality. Beyond all that, good governance must include: Sound processes for vetting requests for access to personal data and for assessing the likely public benefits and the risks to privacy. Access to the data is provided through ‘safe havens’ which have strong security controls. Safeguards on physical data storage based on international safety standards. Undertakings from the users about ‘identifiability’, with effective legal sanctions should these be broken (e.g. gaol sentences). Governance standards must be published and readily accessible. Those responsible for their oversight and implementation must be specified and complaint and appeal mechanisms published. 35 https://www.gov.uk/government/uploads/system/uploads/attachment_data/file/192572/2900774_InfoGovernanc e_accv2.pdf 11 Beyond these hugely important governance arrangements within organisations handling information, appropriate independent regulation arrangements need to be in place. For all official statistics this is provided under statute by the UK Statistics Authority. The Office of Public Sector Information (PSI) plays an important role in regard to the use of PSI, not least in supervising adherence to UK legislation arising from relevant EU Directives. An array of other regulators (e.g. within the NHS, the Office the Information Commissioner and the Office of Fair Trading) play important roles though the complexity of the information-related regulation landscape makes for many difficulties. In addition to these parts of the ‘official infrastructure’ of regulation there are some civil society organisations which analyse and comment on public statements in the media and by politicians. A good example is Full Fact36. A plethora of constant comment is also produced daily via blogs. Some of this – such as by Ben Goldacre on health matters – is invaluable. In totality they provide a valuable correction mechanism to erroneous or misleading information. 6. Conclusions The most fundamental question I set out to address is how does the lay public know what information to trust and who to trust? All of the above is designed to underpin the answer to this question though my answer will not guarantee success! In an ideal world, the lay public would have much greater understanding of quantitative approaches including uncertainty and risk. Uncertainty exists in various parts of any analysis or study, whether it is in the measurement process, the consequences of inbuilt assumptions and of selecting particular models. Achieving this is a long term aim. What follows deals more with the short term. For the data scientist, sensitivity testing of all assumptions and models built on them is essential and must be considered in assessing and reporting results. Identification of hazards and risks, plus mitigation of the latter is central to any application of evidence. Confirmatory bias – more readily accepting results which support prior intuition – must be guarded against. The non-scientist manager will wish to ensure that all these steps have been taken. But other ‘sense tests’ may help to guide the non-expert. These include: 36 Be sceptical but be willing to be convinced. Triangulating information from multiple sources to assess consistency and understand reasons for any differences. Thus the varied but broadly consistent responses from multiple central government departments to the January 2014 public consultation on the importance and uses of Census information, alongside that from the Local Government Association, were very convincing. Relying on organisations that others trust and whose public statements are subject to expert scrutiny by other experts (such as by the Institute for Fiscal Studies and Full Fact) – and preferably those that use Plain English and have a long and good track record. Placing greater reliance on data and information from long-established, successful organisations which have much to lose if their reputation is sullied (but beware of monpolies). Other things being equal, placing reliance upon bodies which are relatively independent, having their own endowments and which operate at arm’s length from governments or other funders is wise. Requiring that any individual or organisation which produces evidence can explain the sources and characteristics of data or information they used, how it was derived, the quality assurance processes used and the main sources of uncertainty (i.e. the metadata). https://fullfact.org/ 12 Having appropriate information governance arrangements, suitably comprehensive and publicly accessible. Is the organisation making questionable statements regulated and is there redress for any failures available? As indicated in Section 5, anyone holding management responsibility within an informationproducing and –consuming organisation (i.e. all of them) must ensure not only that good governance processes operate. They must also ensure that horizon scanning for new developments and risks relevant to their organisation is carried out by suitably tasked staff. In a world of uncertainty and ‘hype’ arising from revolutionary technological developments, scepticism is a sound approach. But these developments affect all of us whether we wish it or not. It would be singularly unfortunate to play so safe that we lost opportunities to explore and better understand the natural environment, human societies, health and well-being. All of us must make our own judgements about the balance of risk and reward. But for me, the use of the great bulk of nonpersonal information is not contentious and the benefits are often likely to be significant. Even in regard to personal information, I believe that we should exploit the benefits of its use for the public benefit, using guidance, control mechanisms and safeguards such as those extolled above. I hope you agree. Acknowledgements Thanks are due to those who made suggestions on the contents of a draft of this paper - Ed Humpherson, Will Moy, Phil Sooben, Carol Tullo and Sharon Witherspoon. Any errors are the author’s responsibility. David Rhind - dwrhind@gmail.com Chair, APPSI April 2014 13