The Impact of Big Data on Social Research David Rhind Sharon Witherspoon 1 www.nuffieldfoundation.org The landscape to be covered… • What is Big Data? Just consultants’ hype? • Key questions for SRA • Technology + other drivers of change • New sources of data and their uses • Big challenges • Back to the future – the next Census • Presentation also matters • Conclusions 2 What is/are Big Data? VOLUME: too large to handle by standard contemporary analytical tools i.e. subjective / relative measure “the total amount of data has grown exponentially: it has been estimated that more data was harvested between 2010 and 2012 than in all of preceding human history.” Source: http://www.bbc.co.uk/news/business-17682304 Certainly made by Mike Lynch; original source IBM? VELOCITY: how fast data is being produced and how fast it must be produced to meet demand. VARIETY: many different forms of data which are used – structured and unstructured (the majority), held in different types of databases as text documents, emails, imagery, videos and much else PROBLEMS: hype, bias in (large) sample, focus on correlations not causality, understanding the results 3 Context and key questions for SRA • Current practice mostly survey-based • Divide exists between expertise in data collection and analysis skills • National shortfall in quantitative analytical skills • Will Big Data, etc change the ground-rules of research practice? • Are established practices becoming obsolete? • Or do we need to assimilate what’s new into established principles of research? 4 Drivers of change • • • • • • Extraordinary rate of technological enhancement Austerity – better vfm sought Transparency Job creation/ increase wealth Calls for better/ more up to date data/info/evidence Threats to traditional approaches e.g. EU Parliament and Data Protection - ‘Specific and explicit consent’ Public sector manifestations of change: data scientists sought by government, support of Open Data Institute, ONS exploration of options, data.gov, ESRC £64m funding & ADRCs 5 Technology change The iPhone 4S 2012 in my pocket Apollo 11 More computing 1969 power than Apollo 3000 x storage of $150 / year 6 IBM 305 disk drive 1956 Leased for $35,000/year 7 New(ish) sources of data • Mobile phone sensors • Proxy: satellite remote sensing 31cm resolution (how to reflect people data?) • Proxy: web scraping (e.g. inflation measures) • Crowd sourcing e.g. OpenStreetMap • Management/ administrative data (public and private sector) • Modelling starting from historic data 8 Visitors and locals in Paris 9 Source: Eric Fischer Uses of different data types • Obtaining data about ‘things’ easy? – see remote sensing examples • People: location and movement of people technically easy via CCTVs, smartphones. ethnicity, age data approximations from names profiles from private sector data or linked governmental administrative data technically easy Best solution usually is combination of data types.. e.g. land cover and use from imagery and company records 10 Real time data collection now routine for some applications 11 Source: UK MoD under the Open Government license, Google and US Geological Survey Different uses of imagery at different resolutions 10m resolution See roads and water features Source: DigitalGlobe 2014 12 1 to 2 metres resolution, See some cars and individual houses 30 to 60cm resolution, See all visible cars, manholes Extreme crowd sourcing: Pyongyang Open Street Map 13 Also MH 370 Source: UK MoD under the Open Government license, Google and US Geological Survey Admin data / management information • Obvious advantages – already exists, often continuously maintained, linkage of personal admin data facilitates valuable research and fraud reduction BUT • You get (at best) what is created for other purposes • Content or classification changes mess up time series • Personal admin data sharing and privacy debate… • Has raw data quality been audited properly (English police recorded crime statistics)? 14 Ratio between CSEW incidents and crime recorded by the police 15 Adding value = a commercial asset Can have huge value e.g. Climate Corporation: 2006 start-up by 2 ex-Google staff Linked US government weather, crop yield and soil data Provide yield forecasting and planting advice, weather and crop insurance Bought by Monsanto October 2013 for $930m 16 Big Challenges • Trade-off between data integrity and currency. How good is ‘good enough’? How fast is fast enough? • Want to anticipate the future as well as know the past • Private sector increasingly active in data collection and exploitation e.g. Markit surveys used by Bank of England. Internationalisation of data collection/assembly growing. • Public understanding: problem with use of technical language e.g. public doesn’t really understand ‘n year flood’ concept. PM confusion of deficit and debt. Changed role of data constructor/statistician? – mentors and advocates? • This all a matter for the very young? 17 18 Back to the future with surveys? 19 The 2011 Census • 2011 Census survey data collection went well but total cost £480m • Basically very similar to what done for decades; 16% completed on-line • Results started to become available 15 months after survey but much still being published after 3.5 years • Changing society more difficult to complete forms • Statistics Commission, Treasury Select Committee and UKSA said ‘no more traditional census’ 20 LFS Response Rates 1993 to 2008 Source: ONS US experience is similar – an average of 20% reduction in 20 years 21 The 2021 ‘Census’ • Very strong support from public consultation for continuation of some form of Census • ONS plan now accepted in principle by government • Model is for an on-line Census+: aim to achieve high (e.g. 65)% of online completion of forms aim to enrich census data by adding variables derived from admin data wherever possible much research under way… • US Bureau of Census experimenting with use of smartphone-derived data 22 Source: ONS 23 Data presentation also matters! 24 Basic arithmetical error – it should be “almost £400” not “almost £4000!” PM confusing deficit and debt… 25 National Infrastructure Plan: Pipeline Value by sector (£m) Pipeline value by sector 250,000 Moral: how information is presented can seriously mislead (note log scale on Chart 2) 200,000 Capital Value £ million 150,000 100,000 50,000 - 26 Communications Flood Transport Water Conclusions 27 • Much Big Data hype but a revolution is under way • This will change the way we assemble data and do social science to extract added value • Much more work will be by multi-disciplinary teams with higher level analytic, quantitative and presentational skills in various disciplines • Greater focus still needed on data quality issues • Need focus on data sharing governance, ethics and safeguards and on advocacy of benefits • Q-Step will help – a BIT. But organisations like the SRA and its members have an important role! 28 Thank you 29